navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf ·...

15
Navigating the disease landscape: knowledge representations for contextualizing molecular signatures Mansoor Saqi*, Artem Lysenko*, Yi-Ke Guo, Tatsuhiko Tsunoda and Charles Auffray Corresponding author: Artem Lysenko, Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan. E-mail: [email protected] *These authors contributed equally to this work. Abstract Large amounts of data emerging from experiments in molecular medicine are leading to the identification of molecular sig- natures associated with disease subtypes. The contextualization of these patterns is important for obtaining mechanistic insight into the aberrant processes associated with a disease, and this typically involves the integration of multiple hetero- geneous types of data. In this review, we discuss knowledge representations that can be useful to explore the biological context of molecular signatures, in particular three main approaches, namely, pathway mapping approaches, molecular network centric approaches and approaches that represent biological statements as knowledge graphs. We discuss the util- ity of each of these paradigms, illustrate how they can be leveraged with selected practical examples and identify ongoing challenges for this field of research. Key words: precision medicine; molecular medicine; multi-omics; disease modeling; integrated knowledge networks Introduction Owing to technological advances allowing rapid and low-cost profiling of biological systems, multiple types of omics data are now routinely collected from patient cohorts in studies of human diseases. These data can lead to a new taxonomy of disease [1]. Diseases that were previously considered to be sin- gle homogeneous conditions may in fact be collections of sev- eral disease subtypes. Identification of subtypes allows targeting of the underlying molecular processes involved in the particular form of disease associated with the subtype, and can lead to more personalized therapeutic strategies. A major Mansoor Saqi is a Research Fellow at the Data Science Institute at Imperial College London. His research interests are in translational medicine informat- ics, in particular data integration, and analysis of biological networks. Artem Lysenko is a Postdoctoral Researcher at RIKEN Laboratory for Medical Science Mathematics. His research is in the areas of biomedical machine learning, applied data science and biological network analysis. Yi-Ke Guo is the Director of the Data Science Institute at Imperial College London. He has been working on technology and platforms for scientific data analysis since the mid-1990s. His research focuses on knowledge discovery, data mining and large-scale data management. He is a Senior Member of the IEEE and a Fellow of the British Computer Society. Tatsuhiko Tsunoda is a Full Professor at Tokyo Medical and Dental University, and a Chief Scientist and the Director of Laboratory for Medical Science Mathematics at RIKEN. His research interest is in development and applications of mathematical models and algorithms for genome analysis, multiomics and precision medicine. Charles Auffray is the Founding Director of the European Institute for Systems Biology and Medicine (EISBM) Lyon. He develops a systems approach to complex diseases, integrating functional genomics, mathematical, physical and computational concepts and tools. He has participated in or coordinated multiple European Union projects. Submitted: 9 November 2017; Received (in revised form): 5 February 2018 V C The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 1 Briefings in Bioinformatics, 2018, 1–15 doi: 10.1093/bib/bby025 Review Paper Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bby025/4978142 by guest on 08 January 2019

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

Navigating the disease landscape

knowledge representations for contextualizing

molecular signaturesMansoor Saqi Artem Lysenko Yi-Ke Guo Tatsuhiko Tsunoda andCharles AuffrayCorresponding author Artem Lysenko Laboratory for Medical Science Mathematics RIKEN Center for Integrative Medical Sciences 1-7-22 Suehiro-choTsurumi-ku Yokohama Kanagawa 230-0045 Japan E-mail artemlysenkorikenjpThese authors contributed equally to this work

Abstract

Large amounts of data emerging from experiments in molecular medicine are leading to the identification of molecular sig-natures associated with disease subtypes The contextualization of these patterns is important for obtaining mechanisticinsight into the aberrant processes associated with a disease and this typically involves the integration of multiple hetero-geneous types of data In this review we discuss knowledge representations that can be useful to explore the biologicalcontext of molecular signatures in particular three main approaches namely pathway mapping approaches molecularnetwork centric approaches and approaches that represent biological statements as knowledge graphs We discuss the util-ity of each of these paradigms illustrate how they can be leveraged with selected practical examples and identify ongoingchallenges for this field of research

Key words precision medicine molecular medicine multi-omics disease modeling integrated knowledge networks

Introduction

Owing to technological advances allowing rapid and low-costprofiling of biological systems multiple types of omics data arenow routinely collected from patient cohorts in studies ofhuman diseases These data can lead to a new taxonomy of

disease [1] Diseases that were previously considered to be sin-gle homogeneous conditions may in fact be collections of sev-eral disease subtypes Identification of subtypes allowstargeting of the underlying molecular processes involved in theparticular form of disease associated with the subtype and canlead to more personalized therapeutic strategies A major

Mansoor Saqi is a Research Fellow at the Data Science Institute at Imperial College London His research interests are in translational medicine informat-ics in particular data integration and analysis of biological networksArtem Lysenko is a Postdoctoral Researcher at RIKEN Laboratory for Medical Science Mathematics His research is in the areas of biomedical machinelearning applied data science and biological network analysisYi-Ke Guo is the Director of the Data Science Institute at Imperial College London He has been working on technology and platforms for scientific dataanalysis since the mid-1990s His research focuses on knowledge discovery data mining and large-scale data management He is a Senior Member of theIEEE and a Fellow of the British Computer SocietyTatsuhiko Tsunoda is a Full Professor at Tokyo Medical and Dental University and a Chief Scientist and the Director of Laboratory for Medical ScienceMathematics at RIKEN His research interest is in development and applications of mathematical models and algorithms for genome analysis multiomicsand precision medicineCharles Auffray is the Founding Director of the European Institute for Systems Biology and Medicine (EISBM) Lyon He develops a systems approach tocomplex diseases integrating functional genomics mathematical physical and computational concepts and tools He has participated in or coordinatedmultiple European Union projectsSubmitted 9 November 2017 Received (in revised form) 5 February 2018

VC The Author(s) 2018 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40)which permits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

1

Briefings in Bioinformatics 2018 1ndash15

doi 101093bibbby025Review Paper

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

challenge for translational medicine informatics is the effectiveexploitation of these data types to develop a more complete pic-ture of the disease in particular a description of how changes atthe molecular level are associated with the disease mechanismand disease pathophysiology The molecular profiles by them-selves do not in general offer immediate insight into the mech-anism of disease or the underlying causes and may be oflimited utility for suggesting targets for therapeutic interven-tion Putting these molecular patterns into a broader biologicalcontext represents a useful approach for understanding theunderlying themes involved in the disease pathology and thisinvolves integrating the molecular profiles with other datatypes including pathways cellular and physiological data

Together with data warehousing and data analytics the con-textualization of data emerging from high-throughput experi-ments is an important component of a translational informaticspipeline The contextualization of experimental data is facili-tated by mapping the data to background knowledge which caninclude information at multiple levels of granularity Aneffective representation of the disease needs to relate disease-specific information to background knowledge so as to help re-searchers identify how dysfunctional proteins pathways orother molecular processes lead to the cellular or physiologicalchanges contributing to its aetiology

Here we review efforts to represent the context of disease-implicated genes and we suggest that they can be divided intothree broad themes namely pathway-centric molecular net-work centric and approaches that represent biological state-ments as a knowledge graph (Table 1) We describe theadvantages and drawbacks of the different representations Wedo not discuss the details by which the genes are mapped topathways or networks (for review of approaches to data inter-pretation see for example [2])

Difficulties with representations ofdisease mechanism

The aim of contextualization is typically to obtain insight intodisease mechanism However defining disease mechanism isnot straightforward The most appropriate representation of thecontext depends to some extent on what data are available andwhat questions are being asked Disease mechanisms can berepresented at different levels of granularity At high granular-ity a disease mechanism would describe all known temporal

steps for example the detailed steps in a signalling pathwaysso that the downstream consequences of proteins aberrant inthe disease condition can be followed However informationabout the disease at lower resolution for example describingonly indirect relationships or correlative relationships can alsogive mechanistic insight Examples of such low-level granular-ity descriptions of disease include statements like STAT6activation is linked to mucous metaplasia or Fluticasone up-regulates expression of FKBP51 in asthmatics (see [3]) or geneADAM33 implicated in asthma is also implicated in chronic ob-structive pulmonary disease and essential hypertension [4]

Hofmann-Apitius [5] describes mechanism as a causal rela-tionship graph that involves multiple levels of biological organ-ization [5] Recently in the Big Mechanism Project (BMP) [6]efforts have been initiated toward building mechanistic modelsof large complex biological systems such as those involved incancer using large-scale text mining (TM) followed by modelconstruction using a variety of frameworks Cohen [6] describesmechanism in terms of a model M which maps an input x to anoutput y y M f^(x)(axthorne)frac14gt y where the real mechanism f isapproximated by f^ An important part of the goal of the BMP isnot only to develop a collection of such models but also to auto-mate the process of their construction Therefore a large part ofthe projectrsquos efforts is devoted to development of sophisticatedTM systems that can automatically capture such mechanisticmodels directly from the literature [6] The conceptualization ofa mechanistic model put forward by Cohen [6] implies that sucha model would be made of parts that reflect some correspond-ing real components This view provides a natural way of com-bining the models (as such parts can be mechanistic modelsthemselves) though it also follows that at some level a simplifi-cation will need to be made eg for understanding the regula-tory cascade it may not really be desirable to model the processat the level of individual atoms of the protein moleculesinvolved Potentially in the scientific literature such an ab-straction may be set at different levels as human disease maybe for example studied at physiological and molecular levelswith causal mechanistic relationships being found at both ofthem One of the ongoing challenges in achieving the mechan-istic understanding of disease is to develop tools and formal-isms that can capture model and reconcile the representationsof these different perspectives

There is much information that may help to suggest diseasemechanisms to be included in an integrated representation ofdisease For example information from disease-associated

Table 1 Approaches to knowledge representation for contextualizing disease biomarkers

Approach Examples ofFormatsFrameworks

Advantages Drawbacks

Pathway-centric

SBML Ease of Navigation (eg using NaviCell Google Maps API) Difficult to represent disease contextSBGNBioPax

Integratedmolecularnetworks

GeneMania Easy-to-use resources and tools Difficult to represent disease contextalthough connecting layers of informa-tion can provide some context

STRING

Knowledgegraphs

openBEL Agility of graph databases semantic web approaches offera federated solution openBEL framework capturescontext

Lack of formal ontology in graph databaserepresentations Semantic Web solu-tions do have a formal ontology butlack agility of graph databases such asNeo4j

RDFMalacardsBioXMTM

2 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

single-nucleotide polymorphisms (SNPs) information from pro-teinndashprotein interactions and sequence similarity informationon the effects profiled in mutational studies in modelorganisms and the relationships of molecular variations withphysiological and cellular changes The suitable representationof this information in a manner that can be efficiently storedenvisioned and queried is important for obtaining a mechanis-tic understanding of disease

What is the most appropriate way to represent disease-related information in a computable format that can be usefulfor hypothesis generation and for obtaining mechanistic in-sight To some extent this depends on what questions arebeing asked of the data Here we describe three approaches torepresentation of contextual information namely pathway-centric approaches molecular network centric approaches andinformation graphs (Figure 1) We contend that data integrationis a key component of all three approaches

Network and graph-based representations are prodigiouslyeffective representations for integration query exchange andvisualization of numerous possible relationships and inter-actions between biological entities Owing to well-studied for-mal mathematics of graphs they also often serve as a key datastructure for algorithmic machine learning and statisticalapproaches for analysis of these data Construction of biologicalnetworks from high-throughput experimental data is an essen-tial first step that enables many types of analysis discussedbelow While we recognize importance of this research in thisreview our focus will be primarily on downstream applicationsof biological and knowledge networks for molecular signaturecontextualization but for completeness we would like to rec-ommend the following works which cover this closely relatedtopic in detail [7ndash10] For convenience of readers we haveincluded a list of abbreviations used throughout the article inTable 2

To provide a practical illustration of how different types ofcontextualization approaches can be used we have includedexamples based on a severe asthma differential expressiongene signature and a manually curated core asthma gene signa-ture acquired from DisGeNET database [11] The differential ex-pression signature was computed from the study by Voraphaniand Gladwin [12] (GSE43696) which profiled transcriptomics of

moderate and severe asthma as well as healthy controls A fulldescription of all analysis performed is outlined in detail in theSupplementary Methods These examples were developedpurely to illustrate the approaches discussed in this review andpotentially other examples relevant to asthma could have beenchosen

Contextualization using integrated networks

Networks or graphs are sets of nodes and edges where a noderepresents a concept and an edge represents a relationshipBiological concepts can correspond to real physical entities likemolecules genes and proteins or represent more abstract enti-ties like typed groups (pathways functional categories) and dis-eases The relationships represent a meaningful associationbetween these concepts such as lsquoprotein A interacts with pro-tein Brsquo lsquoenzyme participates in pathwayrsquo or lsquoprotein is a tran-scription factorrsquo In many network studies in systems medicinethe nodes are restricted to particular single type (such as pro-tein) and such graphs are termed homogeneous In this inter-pretation an edge can represent a single relationship type suchas a proteinndashprotein interaction or it can represent multipletypes of evidence that associates the two protein nodes For ex-ample in the STRING proteinndashprotein interaction (PPI) database[13] evidence types from multiple sources are used to suggestfunctional associations between proteins The evidence typesinclude genome location co-occurrence in the scientific litera-ture and reported associations in other databases Mechanisticinsight into diseases can be obtained by mapping disease-asso-ciated genes (eg identified from omics data analyses) to suchnetworks The network neighbourhood of the disease genesprovides a context and can suggest new candidates as well ascommon themes at the level of molecular processes and path-ways that may be associated with disease pathogenesisGeneMania [14] is a Cytoscape App that enables the construc-tion of network neighbourhoods from a set of seed genes by in-tegration of a number of constituent networks (eg PPI sharedprotein domains co-expression)

Network representation can be leveraged to discover high-level patterns that underlie organization of complex biologicalsystems and more specifically identify patterns relating to

Figure 1 Examples of representations for contextualizing disease associated genes A pathway centric view of the network neighbourhood of ALOX5 from

ReactomeViz using the Cytoscape plug-in (left) A molecular network centric view using the GenMania Cytoscape plug-in (middle) A heterogeneous network includ-

ing proteins pathways and diseases constructed and displayed using Neo4j (right)

Navigating the disease landscape | 3

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

development of disease One prominent early example of thiswas from the work of Barabasi and co-workers [15] which rep-resented diseasendashgene relationships as a bipartite graph fromwhich monopartite diseasendashdisease and genendashgene graphscould be obtained (the humanndashdisease network and the dis-easendashgene network) giving information about disease comorbid-ities and thereby contextualising known disease genes Bycombining the diseasendashgene network with known PPI networksit was demonstrated that the corresponding disease-associatedproteins for a given disease have a greater tendency to interactin a PPI network Likewise proximity of genes in the networkcan be a strong indicator of their involvement in similar proc-esses Genes associated with a given disease tend to be closertogether in the interactome and overlapping disease neigh-bourhoods are related to greater comorbidity [16]

The network can also be analysed to identify notable topo-logical features many of which have been associated with bio-logically meaningful properties One of such features is themodular organization of the network Modules are groups ofnodes that are more interconnected to other nodes within the

region than to other nodes outside the region [17] In PPI net-works modules have been shown to correspond to functionallysimilar groups of proteins which in some cases may be relevantto disease processes One such example was identified by [18]who using genes from seven stress-related diseases togetherwith the STRING network identified a common subnetworkthat may provide insights into comorbidity of these diseasesAn early example is the network model for asthma [19] thatcombined information from gene co-expression taken from fivepublic gene expression studies with PPI networks and informa-tion from annotation sources The resulting lsquoglobal maprsquo wasthen used to explore common themes in asthma pathogenesis

A powerful example of exploiting the network neighbour-hood of known disease genes to get mechanistic insight is givenin [20] Using a set of 129 asthma implicated (lsquoseedrsquo) genes thatmap to the human interactome together with a novel algorithm[16] for module detection Sharma et al [20] identified a poten-tial asthma disease module containing 441 genes including91 seeds A large number (162) of pathways had at least half oftheir genes in the putative disease module However a strategy

Table 2 List of abbreviations

Abbreviation Expanded name Comment

API Application programming interface A set interfaces tools and functions used for creation of applicationsBEL Biological Expression Language A curation language for structured capturing of data about biological systems and

experimentsBELIEF BEL Information Extraction workFlow

systemA framework for automated parsing of information into BEL format

BioPAX Biological Pathways Exchangelanguage

A standard that formally defines biological pathway conceptualization in OWLformat

BMP Big Mechanism Project A project by US Department of Defense for automated construction of mechanisticcancer models from scientific literature

COPD Chronic Obstructive PulmonaryDisease

A lung disease characterized by impeded breathing and phenotypically similar toasthma

Cypher A query language for Neo4J graph databaseDBMS Database management system A software providing capabilities to mediate storage query and manipulation of

dataEBI European Bioinformatics InstituteEFO Experimental Factor OntologyGO Gene Ontology One of the most widely used ontologies for representing functions of biological

entitiesGWAS Genome-wide association study An observational study that relates germ-line variations between individual to

phenotypesIR Information Retrieval A process of extraction of relevant information from some wider supersetKappa A rule-based declarative language used mostly in molecular biology domainNeo4j Graph-based database solutionOBO Open Biomedical Ontologies

initiativeOBO format is alternative language for authoring ontologies mainly used in biolo-

gical sciencesOWL Ontology Web Language Currently most widely used language for authoring ontologiesPD Process Description language An extension of SBGN standard with additional capabilities to represent temporal

aspects of biological processesPPI Protein-protein interactionRDF Resource Description Format Core data exchange format for the Semantic WebSBGN Systems Biology Graphical Notation A standard for graphical representation for biological and biochemical domainSBML Systems Biology Markup Language An XML-based format for storing and exchanging biological modelsSBO Systems Biology OntologySNP Single Nucleotide Polymorphism A single-base variant or mutation in a genomic sequenceSQL Structured Query Language Family of similar query languages used for data management in relational

databasesTM Text Mining A process of extraction of structured data from free textTMO Translational Medicine OntologyURL Uniform Resource Locator Web address resolving to a resource on the WebXML Extensible Markup Language Popular meta-language for document markup

4 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

involving prioritizing genes on the basis of their asthma rele-vance using gene expression and GWAS information enabled aranking of pathways and subsequent identification of path-ways not traditionally associated with asthma The plausibilityof one of these (the GAB1 signalosome) was shown experimen-tally The authors also explored the differential expression ofthe GAB1 gene in other immune-related diseases This study il-lustrates the use of both pathways and networks with othersupporting evidence (eg gene expression GWAS) to context-ualize disease genes and provide new avenues for understand-ing disease mechanisms It highlights the need for knowledgerepresentations that can capture multiple levels of informationsuch as the conditions in which a gene may be up-regulated

One approach to representation of the complex networks thatare associated with disease is to decompose the information intolayers with each layer representing information at different lev-els of biological organization Radonjic and co-workers [21 22] usenetwork-based methods to represent and integrate informationat three levels namely differentially expressed biological proc-esses transcription factors whose target proteins are differen-tially expressed and physiological information such as energyintake and weight They explored adaptation of white adipose tis-sue to a high-fat diet a biological system that may give clues as towhich molecular processes are involved in obesity and type 2 dia-betes Gustafsson et al [17] discussed the concept of multilayerdisease modules as a possible framework to generate predictivemarkers that capture multiple types of disease-related informa-tion (PPI symptoms etc) The integration of the layers and the in-clusion of addition contextual information associated with thenodes and relationships of the network (eg cell-type specificities)are important for mechanistic insight into disease Network-based data fusion approaches to integration of multiple typesof biological information relevant to disease are described by[23 24] and these offer the potential to predict new disease-asso-ciated genes as well as to explore diseasendashdisease relationships

The network-based integration of different types of informa-tion can suggest underlying mechanisms of disease Huan andco-workers [25] perform an integrated analysis to explore mech-anisms of blood pressure regulations They used weighted geneco-expression network analysis (WGCNA) [26] to identify co-expressed modules that showed correlation to blood pressuremeasurements and integrated SNP data as well as PPI networksand used Bayesian analysis to identify key drivers of the biolo-gical process one of which was selected for further experimen-tal study The identification of associations between diseasescan suggest a common mechanistic basis [27] and can be usedpredictively [28] Molecular network approaches to representingdisease landscapes offer particular views rather than a unifiedframework In some ways they are similar to information over-lay methods in data analytics Points of correspondence be-tween the different views (networks) need to be found to get anintegrated picture that will be useful for suggesting mechanism

To illustrate these contextualization strategies 22 genesfrom the differential expression signature (derived fromGSE43696 study as outlined in Supplementary Methods) weremapped to protein association network from STRING databaseFirst the results were explored using Web-based interactivevisualization offered by STRING database (Figure 2) As proteinsrealize their function through complex sets of interactions im-plications of gene expression changes can often be betterunderstood by assessing its impact on the wider interactomeThe top panel shows disease signature genes as well as 50 genesfrom the first shell identified as most relevant to this input Theanalysis has suggested several high-confidence modules most

of which contained prominent asthma-associated signallingpathways (HIF [29] TGF-beta [30] and cytokinendashcytokine receptorinteraction [31] and circadian clock-related [32]) Several high-degree nodes connecting these multiple modules are alsoknown to be implicated in asthma tyrosine kinase FYN isinvolved in inflammation [33] IL1B shown to be geneticallyassociated with asthma susceptibility [34] and the node withthe highest degree (SMAD4) is involved in airway smoothmuscle hyperplasia [35]

Given that the number of identified human PPIs is now inthe millions visual exploration is typically limited by the sheeramounts of relevant information that can be usefully displayedApproaches like network diffusion and graph random walksoffer a mathematically robust way to select most relevant nodeseven in large graphs These methods are effective for diseaseanalysis as disease-related genes are commonly located closeto each other in the network In this example we have demon-strated the applicability of this key principle and also identified10 most relevant (closest to the signature) genes by the diffusionstate distance metric [36] This analysis highlighted potentialrelevance of the PTEN gene known to be prominently involvedin airway hyperresponsiveness and inflammation [37] as wellas a small module with multiple members of the PI3KAkt path-way that promotes asthma-related airway remodelling [38]

Contextualization by pathway centricrepresentations

Understanding the mechanism of disease often starts with iden-tification of the aberrant molecular pathways that may be asso-ciated with the disease Molecular pathway models aim tocapture sequences of actions or events of molecular lsquoactorsrsquoSome examples of commonly modelled events include biochem-ical reactions state changes and movements between differentcellular compartments Mapping known disease genes to knownpathways is a well-established methodology for contextualiza-tion as pathways capture an intermediate level of organizationbetween molecular entities and phenotype The identification ofhallmark pathways believed to be associated with disease condi-tions has been used to create pathway centric disease maps [39]which are useful for visual exploration and for the overlay ofdata from for example gene expression studies Typically suchapproaches involved identification of disease-related pathwaysby careful manual curation of the literature AlzPathway [40] is acollection of signalling pathways in Alzheimerrsquos disease Themap is represented in the Systems Biology Graphical Notation(SBGN) [41] process description (PD) language and was also madeavailable in other formats The resource also offers a Web inter-face that allows users to review and comment on annotationsthus facilitating community curation Another example of apathway-based disease map is the Parkinsonrsquos disease map [42]which builds on an understanding of the hallmark metabolicregulatory and signalling pathways of the disease [39] TheParkinsonrsquos disease map can be accessed using the Minerva Webserver which has advanced functionality for visualization (usingthe GoogleMaps API) that facilitates manual exploration [43]The Atlas of Cancer Signalling Networks [44] uses a similar ap-proach where several types of biological entities like pheno-type ion and drug can be represented on a zoomable map inSGBN format Clicking on an entity brings up a box with its de-scription and relevant annotations

Ideally a computational framework for contextualizing dis-ease-implicated genes should allow annotation by members of

Navigating the disease landscape | 5

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

the research community and some selected frameworks offerthis important functionality WikiPathways [45] contains a setof open evolving updateable disease-associated pathwaysPathWhiz [46] is a Web-based tool that allows users to drawpathways so as to include contextual information such as cellstissues and organs The pathway maps are designed to be visu-ally appealing to facilitate manual exploration PathWhizincludes metabolic signalling as well as a number of disease-associated pathways Recently the PathWay Collage tool [47]has been developed that allows the visualization of fragmentsof pathways (defined by the user so as to reduce the complexity

of full pathway representations) onto which omics data can beoverlaid

At present there are two leading exchange standardsadopted for representing biochemical pathway networks Thefirst one is Systems Biology Markup Language (SBML) [48] whichwas primarily intended to support development and sharing ofquantitative biochemical models Therefore SBML has been de-signed to be generic with the main building blocks being spe-cies (quantifiable physical entities) processes (which definehow entities are manipulated) and compartments (provide low-level context for processes and entities) As SBML format aims

BLM

CHEK1PLXND1TIPIN

SEMA3ECLSPN

ATRMAP3K6

HNMT

KCNN4MCM2DPPA5

CMCMMMMMMMMMMMMMMMMMMMMMMMMTIMELESSPER1

ARNTL SYT13

CLOCKBMP5CAMK2A GDF5GDGD

SMAD1CRY2 CAMK2D

SMAD4CAMK2G BMPR2MYO19 MPR2

ACVR2BSMAD5CAMK2BGRIN2B

FHOD1 SMAD9GRIN2A

D9D99D9D99D99999999999999999999999999DD9D99999999999999999999999999999

BMPR1ASTAP2 BMP4

CALM2 BMP10BMP7CALM1

BMP2KRT73

PTK6 SMAD6NOG

SMAD7KHDRBS1KHDRBS2 FYN

77777777777777777

BMP6IL1B

IL1A

HSD17B13 IL1R2UTRN

WNK4 IL19IL20RBDAG1

RAPSNNPHS1NNNCD2AP

IL20AGRN

IL20RA

SLCO1B3NPHS2NNN

KIRRELIL22RA1

CPXM1NAT8

ab

c

1

2

3

4

lt 22eminus16

000023

0000000000011

10

20

30

Random

DisGeNET signature

Severe Asthma signature

Severe Asthma vs DisGeNETYWHAQ

FLT3YWHAE ILK

YWHABYWHAG

YWHAZ SFNPTEN

MATK

Diffu

sio

n s

tate

dis

tan

ce

Distances between

members of gene signatures

Ten closest genes to severe

asthma gene expression signature

ILK

FLT3

PTEN

YWHABB

YWHAQYWHAE

MATK

YWHAZ

SFNYWHAG

A

B C

Figure 2 Analysis of severe asthma differential expression gene signature in the context of STRING protein association network (A) Fifty genes in the first shell of the

22-gene signature (andashc) high-degree module interconnectivity genes (1) calmodulins and HIF-1 signalling pathway (2) TGF-beta signalling pathway (3) cytokinendashcyto-

kine receptor interaction pathway and (4) circadian rhythm-related genes (B) Diffusion state distance of gene sets this measure is derived from similarity of random

walk profiles for each pair of nodes In greymdashrandom pairs of nodes Distances between members of asthma signatures to each other are shown in blue (severe asthma

versus DisGeNet category shows distances between members of the two sets) Finally 10 closest genes to the differential expression signature set are shown in green

(C) Ten closest genes to the severe asthma signature visualized in the STRING database viewer

6 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

to be generic further subcategories of these concepts are notdefined as part of the standard and instead particular em-phasis is placed on enabling mathematical definition of themodel (eg in the form of ordinary differential equations) Someof the necessary biological underpinnings are added via arelated project SBGN [41] standard which defines how differentSBML entities can be represented graphically In SBML fine-grained annotation or specification of complex contexts is pos-sible through special fields set aside for user data However asthese fields are not bound by the standard they may not besupported by all of the tools and therefore generally would notbe used in evaluation of the model

The SBGN PD language [41] was designed to give an unam-biguous representation of mechanism of action within a biolo-gical pathway that can be interpreted visually to facilitatemanual exploration The PD language can represent all thesteps in for example a reaction showing how a biological entitychanges in a temporal manner However to fully exploit the PDrepresentation for applications like numerical modelling wouldrequire additional data associated with component steps (suchas rate constants) which often is not readily available

Another prominent data exchange standard is BioPAX [49]which was primarily designed for unambiguous sharing of bio-chemical pathway data and therefore offers much wider selectionof domain-specific terms to characterize and categorize bothinteractors and interactions As Ontology Web Language (OWL)forms the core of this standard all of the terms are organized hier-archically and relationships between them are formally definedBioPAX standard aims to facilitate data sharing and integrationvia Semantic Web technologies and Resource Description Format(RDF) If an RDF representation is used the format can be easilyextended to incorporate complex context information howeverBioPAX currently does not aim to offer the means of defining ex-perimental or biomedical context for pathways as by designother standardsontologies can be brought in to model them

Both SMBL and BioPAX are predicated on the developmentsof the past decade when several key technologies and designprinciples were introduced that greatly influenced how biomed-ical data-sharing standards are implemented One of these de-velopments was increasing adoption of formal ontologies thatallowed development of well-structured controlled vocabulariesfor different domains and definition of cross-mappings betweenthem For example the SBML standard allows annotation ofmodels with System Biology Ontology (SBO) terms which arealso used in SBGN and can be mapped on to BioPax via SBPAX[50] Therefore in principle it is possible to integrate modelsacross all of these standards on a qualitative level On a syntac-tic level different formalisms can be integrated by introducingsupport for a low-level common language like RDF which isrepresentation-agnostic and allows all types of data to be linkedvia common identifiers Both of these solutions enable modernstandards to be designed in a modular way where differentspecialized representations can be developed by different com-munities of domain experts and then combined togethermdashoreven extended for the needs of a particular applied project

In recent years the community efforts have been increas-ingly devoted to increasing inter-compatibility across all mainbiomedical standards These efforts include development ofconverters controlled vocabulary mappings and better supportof different standards both by data providers and analytical tooldevelopers Greater compatibly also means that smaller models(eg from individual studies) can now be meaningfully sharedand combined One prominent resource that aims to facilitatethis process is NDex [51] NDex allows researchers to share their

models in a variety of different formats which can then be useddirectly by using the NDex website like a data warehouse orintegrated into a common representation by using the NDexCytoscape [52] plug-in

Pathway-based representations enable relevant methods toincorporate the directionality and interaction type (eg up- ordown-regulation) and in this way they can capture causal as-sociations underlying different disease mechanisms To illus-trate this application we have used lsquoTied Diffusion throughInteracting Eventsrsquo method [53] which can identify significantlyimplicated pathways connecting two sets of genes (in this caseDisGeNET asthma signature and our severe asthma signature)The diffusion model used by this method can take into accountmagnitude of effects direction and type of interactions Afterapplying this approach the regulatory subnetwork fromReactome found to be significant for these two sets wasvisualized in Cytoscape (Figure 3) It is possible to see that twointerleukin genes (both involved airway inflammation [54])were found to be particularly critical with lower panel showingkey associations involved

Contextualization by knowledgegraph representations

There are many specialist sources of disease-related informa-tion such as databases of mutations and their consequences Inaddition there is vast information in unstructured formats inthe scientific literature and much of it is contextual for ex-ample statements that describe correlations between geneproducts or between small molecules and gene products thatare observed in particular cell types Information in the litera-ture may also describe weak associations only seen in certainconditions or for which there may be limited evidence thatcould nevertheless still be useful for hypothesis generation

The Biological Expression Language (BEL) [55] representsstatements in biology typically correlative and causative state-ments as well as the context in which these statements applyA biological statement is represented as a triple (subject predi-cate and object) where the subject is a BEL term and the predi-cate is a biological relationship that connects the subject withobject A BEL term can be the abundance of a biological entity orit can be a biological process such as one described in the GeneOntology (GO) [56] or even a given disease condition Each en-tity is described in an existing namespace (eg from databaseslike ChEBI [57]) or in particular user-defined namespace(s)There is a fixed set of causal correlative and other relationshipsthat link these entities The BEL complier checks that the state-ments are syntactically correct and carries out the integrationand alignment of the data by establishing equivalences betweenidentical concepts

The knowledge captured by BEL statements can be repre-sented as a graph The object can be a BEL term or a BEL state-ment Each statement can be associated with metadata thatcontextualizes it for example qualifying it to be true only inspecific tissues The BEL knowledge network can facilitate datainterpretation using a reverse causal reasoning approach [58]For example gene expression data can be mapped to smallersubnetworks that represent cause and effect models The extentto which the models can explain the measurements of differen-tially expressed genes can then be assessed and potentiallyyield insight into the mechanism

Hofmann-Apitius [5] posits that an understanding of diseasemechanisms involves integrating information at multiple

Navigating the disease landscape | 7

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

biological scales To some extent BEL offers a multiscale repre-sentation that other languages do not Using BEL Hofmann-Apitius and co-workers have explored mechanistic aspects ofAlzheimerrsquos disease [59ndash62] They used the BEL language to rep-resent information from the literature together with relevantexperimental data sets to build integrated evidence networksThis approach helps to mine what they refer to as the know-ledge lsquogrey zonersquo which includes both established and emerg-ing connections between entities This approach can bepowerful for linking mechanisms associated with biomarkers[60]

Constructing such information-rich representations can in-volve considerable manual curation efforts For this reason TM

will be increasingly important for supporting this process [63]Already existing examples include the BEL InformationExtraction workFlow (BELIEF) system and BELSmile [64] Theseframeworks work by using automated TM to detect BEL con-cepts and relationships and then allowing the curators to ex-plore and annotate their biological contexts to create final high-quality networks

Another example of a specialized curation data modellingformat is the Nanopublication standard [65] Nanopublicationdefines rules for publishing data in its simplest possible for-msmdashindividual assertions in the form of statements composedof a subject predicate and object plus supporting provenanceinformation and relevant metadata Nanopublication is

Figure 3 Significant pathways identified using Tied Diffusion through Interacting Events method an approach that can extract the most likely set of interconnectors

between two sets of nodes In this example core asthma genes from DisGeNET and severe differential expression signature were considered in the context of

Reactome regulatory pathway network In both panels colour intensity indicates weight magnitude assigned by the algorithm (A) Complete set of all significant genes

Dashed lines indicate undirected edges of lsquocomponentrsquo type whereas solid edges show directed interactions (B) subset of the network relating to the IL1RN gene

Arrow and bar-terminated edge styles show activation and inhibition respectively

8 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

designed to be used in combination with Semantic Web tech-nologies and uses RDF a format that allows seamless incorpor-ation of ontology terms and cross-references through globallyunique shared identifiers The advantages of this standard in-clude decentralized (federated) publishing machine-readablerepresentation of data and an ability to track provenance at thehighest possible level of granularity

Knowledge network representations of biological data are in-creasingly being leveraged in in integrative analysis platformswhich aim to combine information resources with commonanalysis and visualization tasks One example of such platformsis Ingenuity Pathway Analysis (IPAVR ) IPAVR suite aims to facili-tate interpretation of experimental data from high-throughputomics experiments and includes methods for identification ofcausal networks and visual exploration of biological graphs [66]The BioMax BioXMTM platform uses an object relational data-base architecture to store query and envision disease informa-tion including molecular genetic physiological and clinicaldata as well as biological models This platform has been usedto create a resource for chronic obstructive pulmonary disease(COPD) called lsquoCOPD knowledge basersquo [67ndash70] The authors ofCOPD knowledge base do acknowledge that much disease-related information remains hidden in the literature and alsorecognize the difficulties of capturing quality and context speci-ficity The Malacards resource [71] uses a relational database forthe semantic integration of multiple sources of informationabout diseases and also provides network visualizations of dis-easendashdisease relationships GeneAnalytics [72] enables context-ualization of functional signatures (gene sets) in the context oforgans and tissues and compartments as well as other annota-tion sources building on the GeneCards [73] and Malacardstechnologies

Recent advances in graph-based databases have opened upadditional possibilities for organizing heterogeneous and inter-linked data Graph databases have several features that makethem particularly attractive for management and exploratoryanalysis of complex biological data sets A particularly notewor-thy feature is excellent performance of traversal-type queriesacross entities of diverse types they offer Such queries could bechallenging to realize in relational databases because of the costof computing statements across different tables Traversalqueries can be an important way of identifying new relation-ships in integrated data particularly in the cases where linksonly become apparent once multiple sources have been inte-grated but may not be immediately obvious because of sheeramounts of data being involved Modern graph databases offersophisticated query languages and often also other types offramework-specific functionality to facilitate applied analysislike an ability to extend the framework with user-defined func-tions [74] and integration with ontologies and reasoning sys-tems One example of a graph database solution that has beenalready used in biomedical domain [75ndash78] is Neo4J DBMSNeo4J is a Java-based solution which also offers its own graphquery language (Cypher) which is conceptually similar to SQL

Semantic Web [79] RDF [80] and Linked Open Data [81] areemerging important standards for sharing biomedical data Atits core Semantic Web is characterized by its use of globallyunique unambiguous identifiers for entities The identifier usedfor this purpose is commonly a web link (URL) The RDF formatis structured as a set of statements with three parts (subjectpredicate and object) where a subject and object are entitieswith a unique identifier and predicate defines the type of a re-lationship between them Linked Open Data standard estab-lished a set of rules for providing meaningful RDF at the

resolvable identifier URL which generally is in the form of state-ments that provide meaningful information about that entitythat can in turn link it to other URL-identified entities and fi-nally the entities and statements can be linked to specializedontologies in OWL [82] that provide additional informationabout how to interpret them within a particular context

From the perspective of biomedical data integration RDFprimarily plays the role of a low-level data exchange standarduseful for addressing the syntactic heterogeneity issue andfacilitating data integration Although RDF representations canbe interpreted as integrated knowledge networks usually anadditional set of standards will be necessary for meaningful in-terpretation of data from a biological perspective Prominent re-sources that offer biological data on the Semantic Web includeEBI RDF platform [83] (covers such important databases likeUniProt [84] Reactome [85] Ensembl [86] ChEMBL [87] andExpression Atlas [88]) Bio2RDF (aims to mirror in RDF most ofthe prominent biological databases) [89] and OpenPhacts (usesSemantic Web technologies to represent the chemogenomicsspace) [90] Some noteworthy ontologies used in combinationwith these resources include Experimental Factor Ontology(EFO) [91] SBO [92] Translational Medicine Ontology (TMO) [93]SNOMED (medical terms and concepts) [94] and GO (functionalannotation of genes and proteins) [56]

To illustrate the application of integrated knowledge net-works with our example 22-gene severe asthma signature wehave used a Neo4j-based Hetionet v10 resource [95] that inte-grates a wide variety of biomedical information First a querywas prepared to visualize all of the diseases linked to the sig-nature genes (Figure 4) According to these results only onegene has already been linked to asthma though three othergenes have been linked to other lung diseases (COPD and idio-pathic pulmonary fibrosis) Next the connections betweenknown asthma-related pathways severe asthma differentialexpression signature and drugs targeting those common path-ways were retrieved In terms of query-specific degree of drugnodes (ie number of links to returned proteins) top threedrugs were dexamethasone betamethasone and niclosamideThe former two are corticosteroids already used for treatingasthma [96] and the last one was recently proposed as a highlypromising treatment because of its strong TMEM16A antagon-ism found to improve lung function in human and mousemodels [97]

Approaches for extraction ofcontextual knowledge

Ultimately all types of models in biomedical domain originatefrom data collected during relevant experiments and such datawould typically require both different levels of preprocessing andinterpretation to yield such models At present relatively fewtypes of raw experimental data can be straightforwardly con-verted into these representations without requiring human inter-pretation Some possible examples are construction of co-expression networks [98 99] from transcriptomic data and se-quence homology graphs [100 101] However currently manualinterpretation and publishing of inferred models as free text inscientific papers remains the primary means of disseminatingthis knowledge and all major biomedical databases discussed inthis review expend considerable curation efforts to collect verifyand consolidate it in standardized formats from source literature

Given the volume and ever-increasing rate of scientific pub-lishing automated means of facilitating this process are

Navigating the disease landscape | 9

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Figure 4 Query expressed in natural language and resulting output produced for the differential expression gene signature using Hetionet graph database The follow-

ing entities are shown proteins (blue) pathways (yellow) diseases (brown) and drugs (red) (A) Query for associated diseases (B) query to explore the connections be-

tween niclosamide drug asthma and differential expression signature linked via relevant proteins and pathways

10 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

becoming increasingly essential This is achieved through appli-cations of information retrieval (IR) and TM methods which canhandle tasks like identification of relevant documents sectionsand sentences recognition of concepts and relationships be-tween them or even generation of novel hypotheses [102]Introduction of computational text analysis methods into thecuration process has been reported to improve its efficiency bybetween 2- and 10-fold [103] Especially in the biomedical com-munity development of TM approaches has greatly benefitedfrom competitive evaluation challenges [104] Some prominentexamples of such initiatives include BioCreative TRECGenomicsMedicalCDS BioNLP-ST i2b2 and ShAReCLEFeHealth These regular competitions provide a valuable bench-mark for current state of the art for different tasks and serve tohighlight the most important problems focus community ef-forts and establish necessary annotated textual resources

As well as assisting with curation both TM and IR can beapplied in a high-throughput and automated manner howeverthe utility of these applications is still limited by accuracy andgeneralizability [103] In its simplest form various statistical co-occurrence methods can be applied to build networks linkingbiological concepts frequently found together in differentpapers Some examples of resources offering such literature-derived data are miRCancer that compiled data about co-occurrence of microRNA mentions with different cancer types[105] and DISEASES resource which provides similar informa-tion for gene mentions with particular diseases [106]

Extraction of information from text is contingent on unam-biguous identification of entities and relationship types fromwhich model statements can be constructed Therefore TMtechnologies have a natural synergy with ontology-based datamodelling and to a lesser extent Semantic Web formats likeRDF [102 107] The ontologies are used both as controlled vocab-ularies to resolve alternative naming conventions and as sour-ces of common identifiers for recovered concepts andrelationships Some examples of specialized relationship typesthat can be recovered include proteinndashprotein [108] drugndashpro-tein [109] and drugndashdrug [110 111] interactions and associ-ations of genes with mutations [112 113] and phenotypes [114ndash116] Data from IR and TM are increasingly incorporated along-side manually reviewed information by major databases for ex-ample PPIs extracted from literature by fully automatedapproaches are offered by STRING [13 117]

To handle more advanced tasks it is usually necessary tocombine specialized tools into TM workflows eg a hypothesisgeneration tool might rely on output of a relationships identifi-cation tool which in turn relies on named entity recognitionInevitably this raises questions of interoperability of differentanalysis methods One way to address this is has been throughincreasing use of Web services to link up different tools [118]Additionally specialized standards for biomedical textual datainterchange were developed for example XML-based BioCformat [119]

Recent efforts have been exploring the possibility of extract-ing more complex statements from text [120] and in 2015 theBioCreative V challenge has for the first time included a task ofautomated extraction of OpenBEL statements from text [121]During the competition the best result was achieved by theBELMiner framework [122] which used a rule-based approachfor extraction of BEL statements from text and got the highestF-measure of 20 for this task After the challenge BELSmilesystem was developed that reached an even better score of 28on the same dataset [64] BELSmile implemented a semantic-role-labelling approach which relies on recognition of verbs

(predicates) and assignment of roles to associated subjectsob-jects relative to that verb Notably designs of both frameworksintegrated multiple specialized IR and TM tools to handle spe-cific low-level subtasks

It is clear that at present a substantial proportion of valuablebiomedical knowledge is lsquotrappedrsquo in scientific text in a repre-sentation not readily suitable for automated computer-driveninterpretation Appropriate data formats and ontologies are anessential prerequisite for enabling sophisticated TM methods tobegin addressing this challenge The most recent developmentsindicate that current TM approaches are becoming powerfulenough to leverage such rich representations and are alreadyuseful to curators working on complex models of human dis-eases To explore further developments in this important areawe direct reader to the following recent reviews [25 102 123]

Outlook

Many translational and systems medicine projects involve col-lection of molecular and clinical data with a view to the identifi-cation of disease subtypes The data are typically warehousedand analysed for example to extract molecular fingerprintsthat characterize the subtypes and this data interpretation stepof the translational informatics pipeline remains highly chal-lenging Moving from diagnostic molecular patterns to beingable to suggest individualized healthcare pathways is complexData integration links the molecular patterns to backgroundknowledge which can then be reviewed explored and analysedto develop a more detailed understanding of the underlyingbiology that distinguishes disease subtypes

To support this process several standards have emerged tocapture integrate and facilitate analysis of these data This pro-liferation of standards suggests that different formalisms willcontinue to be necessary to model different aspects of biomed-ical data and that we are unlikely to converge on one ultimatemodelling solution in the foreseeable future The open questiontherefore becomes how to effectively manage the interoperabil-ity between these different representations In this respect sev-eral common trends and best practices are beginning to emergeIn particular it is now evident that development of commoncontrolled vocabularies and specialized domain ontologies is es-sential for effective management of increasingly complex bio-medical data Use of ontologies facilitates identification ofequivalent entities and concepts across different modelling for-mats and facilitates conversion between them allowing effi-cient data integration

As outlined by Cohen in [6] modern biomedical data model-ling formalisms can be categorized into lsquocurationrsquo and lsquomechanis-ticrsquo types Aim of the former is to primarily describe experimentalresults (eg Nanopublications BioPAX) including observed asso-ciations between molecular signatures and diseases The latterspecify evaluable models used to simulate living systems andgenerate predictions based on a set of inputs (eg SBML Kappa)It is obvious that although experimental observations are ultim-ately used to construct such models the processing stepsrequired to correctly generate the latter from the former wouldgo beyond a simple format change We believe that automationof this process is a highly promising emergent strategy for devel-opment of fully mechanistic models of disease Use of automatedmethods to support biomedical innovation is necessitated bothby the need to effectively deal with ever-growing volumes of dataand by the increasingly complex nature of biomedical researchitself The apparent diminishing returns in terms of noveldrugs successfully taken to market relative to the money spent

Navigating the disease landscape | 11

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

[124 125] with particular challenges for cardiovascular [126] andneurodegenerative diseases [127 128] mean that increasinglymore complex strategies for managing disease are needed tomake progress [129] Consequently high-throughput computa-tional approaches are becoming indispensable both for assistingwith interpretation management and mining of experimentaldata as well as for constructing and evaluating relevant clinicalmodels At present cutting-edge efforts in this area involve cre-ation of increasingly sophisticated automated methods for ana-lysis of vast quantities biomedical data exemplified by initiativeslike DARPArsquos Big Mechanism IBM Watson Health and Garudabiomedical discovery platforms The future modelling formatsare therefore likely to be influenced by the requirements of suchprojects and will increasingly incorporate features to supportfully computer-driven information analysis

In our view accurate handling of the context of biological ob-servations is critical for construction of correct mechanistic mod-els of disease Traditionally information has been compiled intolargely homogenized collections like pathway maps or globalinteractome networks that aim to show all possible processes of acertain type for a given organism However in practice only asmall subset of all these processes will be occurring in a real cellat a given time Furthermore such events may be transient (en-zymatic reactions) have stable outcome (protein complex forma-tion) or be mutually exclusive (alternative protein-bindingpartners) An additional level of complexity is introduced fromthe imperfection of experimental techniques For example tran-scriptomics profiling is often done at a tissue level which maymask important differences at a level of individual cells As suchmechanistic models built from these data may be subject to con-siderable uncertainty and abstraction The next generation of for-mats and integration solutions will therefore need to be morecontext aware Future progress on one hand requires bettermeans of expressing this highly granular context of biologicalprocesses and on the other necessitates solutions to model vastvolumes of data and knowledge in such form Once efficient andaccurate approaches for these tasks are established cutting-edgemathematical and machine learning methods could be leveragedto their full potential to give new insight into disease mechan-isms and suggest improved avenues for therapeutic intervention

Key Points

bull Precision medicine relies on identification of disease sub-types at a molecular level and linking them to diagnosticmodels to determine optimal treatment strategies

bull In practice establishing a contextual link between molecularsignatures and disease processes requires integration ofqualitatively diverse types of relevant data and knowledge

bull At present several alternative philosophies guide theprocess of transformation of experimental data intomechanistically informative and ultimately clinicallyactionable insights

bull The spectrum of possible approaches ranges frompurely associative knowledge link discovery to fullyquantitative mathematical models

Supplementary Data

Supplementary data are available online at httpsacademicoupcombib

Funding

The research leading to these results has received supportfrom the Innovative Medicines Initiative Joint UndertakingeTRIKS Project (Grant Number IMI 115446) resources ofwhich are composed of a financial contribution from theEuropean Unionrsquos Seventh Framework Programme (FP72007-2013) and European Federation of PharmaceuticalIndustries and Associations (EFPIA) companiesrsquo in kind con-tributions The work was also partially supported by CoreResearch for Evolutional Science and Technology (CREST)Grant from the Japan Science and Technology Agency(Grant Number JPMJCR1412) and Japan Society for thePromotion of Science KAKENHI (Grant Numbers 17H06307and 17H06299)

References1 Robinson PN Deep phenotyping for precision medicine

Hum Mutat 201233(5)777ndash802 Glaab E Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen classificationBrief Bioinform 201617(3)440ndash52

3 Woodruff PG Boushey HA Dolganov GM et al Genome-wide profiling identifies epithelial cell genes associated withasthma and with treatment response to corticosteroids ProcNatl Acad Sci USA 2007104(40)15858ndash63

4 Kaneko Y Yatagai Y Yamada H et al The search for com-mon pathways underlying asthma and COPD Int J ChronObstruct Pulmon Dis 2013865ndash78

5 Hofmann-Apitius M Ball G Gebel S et al Bioinformaticsmining and modeling methods for the identification of dis-ease mechanisms in neurodegenerative disorders Int J MolSci 201516(12)29179ndash206

6 Cohen PR DARPArsquos big mechanism program Phys Biol 201512(4)045008

7 Barabasi AL Network Science Cambridge United KingdomCambridge University Press 2016

8 Le Novere N Quantitative and logic modelling of molecularand gene networks Nat Rev Genet 201516(3)146ndash58

9 Boccaletti S Bianconi G Criado R et al The structure and dy-namics of multilayer networks Phys Rep 2014544(1)1ndash122

10 Vidal M Cusick ME Barabasi AL Interactome networks andhuman disease Cell 2011144(6)986ndash98

11 Pi~nero J Queralt-Rosinach N Bravo A et al DisGeNET a dis-covery platform for the dynamical exploration of humandiseases and their genes Database 20152015bav028

12 Voraphani N Gladwin M Contreras A et al An airway epi-thelial iNOSndashDUOX2ndashthyroid peroxidase metabolome drivesTh1Th2 nitrative stress in human severe asthma MucosalImmunol 20147(5)1175ndash85

13 Franceschini A Szklarczyk D Frankild S et al STRING v9 1protein-protein interaction networks with increasedcoverage and integration Nucleic Acids Res 201241(D1)D808ndash15

14 Montojo J Zuberi K Rodriguez H et al GeneMANIA fastgene network construction and function prediction forCytoscape F1000Res 20143153

15 Goh KI Cusick ME Valle D et al The human disease net-work Proc Natl Acad Sci USA 2007104(21)8685ndash90

16 Menche J Sharma A Kitsak M et al Uncovering disease-disease relationships through the incomplete interactomeScience 2015347(6224)1257601

12 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 2: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

challenge for translational medicine informatics is the effectiveexploitation of these data types to develop a more complete pic-ture of the disease in particular a description of how changes atthe molecular level are associated with the disease mechanismand disease pathophysiology The molecular profiles by them-selves do not in general offer immediate insight into the mech-anism of disease or the underlying causes and may be oflimited utility for suggesting targets for therapeutic interven-tion Putting these molecular patterns into a broader biologicalcontext represents a useful approach for understanding theunderlying themes involved in the disease pathology and thisinvolves integrating the molecular profiles with other datatypes including pathways cellular and physiological data

Together with data warehousing and data analytics the con-textualization of data emerging from high-throughput experi-ments is an important component of a translational informaticspipeline The contextualization of experimental data is facili-tated by mapping the data to background knowledge which caninclude information at multiple levels of granularity Aneffective representation of the disease needs to relate disease-specific information to background knowledge so as to help re-searchers identify how dysfunctional proteins pathways orother molecular processes lead to the cellular or physiologicalchanges contributing to its aetiology

Here we review efforts to represent the context of disease-implicated genes and we suggest that they can be divided intothree broad themes namely pathway-centric molecular net-work centric and approaches that represent biological state-ments as a knowledge graph (Table 1) We describe theadvantages and drawbacks of the different representations Wedo not discuss the details by which the genes are mapped topathways or networks (for review of approaches to data inter-pretation see for example [2])

Difficulties with representations ofdisease mechanism

The aim of contextualization is typically to obtain insight intodisease mechanism However defining disease mechanism isnot straightforward The most appropriate representation of thecontext depends to some extent on what data are available andwhat questions are being asked Disease mechanisms can berepresented at different levels of granularity At high granular-ity a disease mechanism would describe all known temporal

steps for example the detailed steps in a signalling pathwaysso that the downstream consequences of proteins aberrant inthe disease condition can be followed However informationabout the disease at lower resolution for example describingonly indirect relationships or correlative relationships can alsogive mechanistic insight Examples of such low-level granular-ity descriptions of disease include statements like STAT6activation is linked to mucous metaplasia or Fluticasone up-regulates expression of FKBP51 in asthmatics (see [3]) or geneADAM33 implicated in asthma is also implicated in chronic ob-structive pulmonary disease and essential hypertension [4]

Hofmann-Apitius [5] describes mechanism as a causal rela-tionship graph that involves multiple levels of biological organ-ization [5] Recently in the Big Mechanism Project (BMP) [6]efforts have been initiated toward building mechanistic modelsof large complex biological systems such as those involved incancer using large-scale text mining (TM) followed by modelconstruction using a variety of frameworks Cohen [6] describesmechanism in terms of a model M which maps an input x to anoutput y y M f^(x)(axthorne)frac14gt y where the real mechanism f isapproximated by f^ An important part of the goal of the BMP isnot only to develop a collection of such models but also to auto-mate the process of their construction Therefore a large part ofthe projectrsquos efforts is devoted to development of sophisticatedTM systems that can automatically capture such mechanisticmodels directly from the literature [6] The conceptualization ofa mechanistic model put forward by Cohen [6] implies that sucha model would be made of parts that reflect some correspond-ing real components This view provides a natural way of com-bining the models (as such parts can be mechanistic modelsthemselves) though it also follows that at some level a simplifi-cation will need to be made eg for understanding the regula-tory cascade it may not really be desirable to model the processat the level of individual atoms of the protein moleculesinvolved Potentially in the scientific literature such an ab-straction may be set at different levels as human disease maybe for example studied at physiological and molecular levelswith causal mechanistic relationships being found at both ofthem One of the ongoing challenges in achieving the mechan-istic understanding of disease is to develop tools and formal-isms that can capture model and reconcile the representationsof these different perspectives

There is much information that may help to suggest diseasemechanisms to be included in an integrated representation ofdisease For example information from disease-associated

Table 1 Approaches to knowledge representation for contextualizing disease biomarkers

Approach Examples ofFormatsFrameworks

Advantages Drawbacks

Pathway-centric

SBML Ease of Navigation (eg using NaviCell Google Maps API) Difficult to represent disease contextSBGNBioPax

Integratedmolecularnetworks

GeneMania Easy-to-use resources and tools Difficult to represent disease contextalthough connecting layers of informa-tion can provide some context

STRING

Knowledgegraphs

openBEL Agility of graph databases semantic web approaches offera federated solution openBEL framework capturescontext

Lack of formal ontology in graph databaserepresentations Semantic Web solu-tions do have a formal ontology butlack agility of graph databases such asNeo4j

RDFMalacardsBioXMTM

2 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

single-nucleotide polymorphisms (SNPs) information from pro-teinndashprotein interactions and sequence similarity informationon the effects profiled in mutational studies in modelorganisms and the relationships of molecular variations withphysiological and cellular changes The suitable representationof this information in a manner that can be efficiently storedenvisioned and queried is important for obtaining a mechanis-tic understanding of disease

What is the most appropriate way to represent disease-related information in a computable format that can be usefulfor hypothesis generation and for obtaining mechanistic in-sight To some extent this depends on what questions arebeing asked of the data Here we describe three approaches torepresentation of contextual information namely pathway-centric approaches molecular network centric approaches andinformation graphs (Figure 1) We contend that data integrationis a key component of all three approaches

Network and graph-based representations are prodigiouslyeffective representations for integration query exchange andvisualization of numerous possible relationships and inter-actions between biological entities Owing to well-studied for-mal mathematics of graphs they also often serve as a key datastructure for algorithmic machine learning and statisticalapproaches for analysis of these data Construction of biologicalnetworks from high-throughput experimental data is an essen-tial first step that enables many types of analysis discussedbelow While we recognize importance of this research in thisreview our focus will be primarily on downstream applicationsof biological and knowledge networks for molecular signaturecontextualization but for completeness we would like to rec-ommend the following works which cover this closely relatedtopic in detail [7ndash10] For convenience of readers we haveincluded a list of abbreviations used throughout the article inTable 2

To provide a practical illustration of how different types ofcontextualization approaches can be used we have includedexamples based on a severe asthma differential expressiongene signature and a manually curated core asthma gene signa-ture acquired from DisGeNET database [11] The differential ex-pression signature was computed from the study by Voraphaniand Gladwin [12] (GSE43696) which profiled transcriptomics of

moderate and severe asthma as well as healthy controls A fulldescription of all analysis performed is outlined in detail in theSupplementary Methods These examples were developedpurely to illustrate the approaches discussed in this review andpotentially other examples relevant to asthma could have beenchosen

Contextualization using integrated networks

Networks or graphs are sets of nodes and edges where a noderepresents a concept and an edge represents a relationshipBiological concepts can correspond to real physical entities likemolecules genes and proteins or represent more abstract enti-ties like typed groups (pathways functional categories) and dis-eases The relationships represent a meaningful associationbetween these concepts such as lsquoprotein A interacts with pro-tein Brsquo lsquoenzyme participates in pathwayrsquo or lsquoprotein is a tran-scription factorrsquo In many network studies in systems medicinethe nodes are restricted to particular single type (such as pro-tein) and such graphs are termed homogeneous In this inter-pretation an edge can represent a single relationship type suchas a proteinndashprotein interaction or it can represent multipletypes of evidence that associates the two protein nodes For ex-ample in the STRING proteinndashprotein interaction (PPI) database[13] evidence types from multiple sources are used to suggestfunctional associations between proteins The evidence typesinclude genome location co-occurrence in the scientific litera-ture and reported associations in other databases Mechanisticinsight into diseases can be obtained by mapping disease-asso-ciated genes (eg identified from omics data analyses) to suchnetworks The network neighbourhood of the disease genesprovides a context and can suggest new candidates as well ascommon themes at the level of molecular processes and path-ways that may be associated with disease pathogenesisGeneMania [14] is a Cytoscape App that enables the construc-tion of network neighbourhoods from a set of seed genes by in-tegration of a number of constituent networks (eg PPI sharedprotein domains co-expression)

Network representation can be leveraged to discover high-level patterns that underlie organization of complex biologicalsystems and more specifically identify patterns relating to

Figure 1 Examples of representations for contextualizing disease associated genes A pathway centric view of the network neighbourhood of ALOX5 from

ReactomeViz using the Cytoscape plug-in (left) A molecular network centric view using the GenMania Cytoscape plug-in (middle) A heterogeneous network includ-

ing proteins pathways and diseases constructed and displayed using Neo4j (right)

Navigating the disease landscape | 3

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

development of disease One prominent early example of thiswas from the work of Barabasi and co-workers [15] which rep-resented diseasendashgene relationships as a bipartite graph fromwhich monopartite diseasendashdisease and genendashgene graphscould be obtained (the humanndashdisease network and the dis-easendashgene network) giving information about disease comorbid-ities and thereby contextualising known disease genes Bycombining the diseasendashgene network with known PPI networksit was demonstrated that the corresponding disease-associatedproteins for a given disease have a greater tendency to interactin a PPI network Likewise proximity of genes in the networkcan be a strong indicator of their involvement in similar proc-esses Genes associated with a given disease tend to be closertogether in the interactome and overlapping disease neigh-bourhoods are related to greater comorbidity [16]

The network can also be analysed to identify notable topo-logical features many of which have been associated with bio-logically meaningful properties One of such features is themodular organization of the network Modules are groups ofnodes that are more interconnected to other nodes within the

region than to other nodes outside the region [17] In PPI net-works modules have been shown to correspond to functionallysimilar groups of proteins which in some cases may be relevantto disease processes One such example was identified by [18]who using genes from seven stress-related diseases togetherwith the STRING network identified a common subnetworkthat may provide insights into comorbidity of these diseasesAn early example is the network model for asthma [19] thatcombined information from gene co-expression taken from fivepublic gene expression studies with PPI networks and informa-tion from annotation sources The resulting lsquoglobal maprsquo wasthen used to explore common themes in asthma pathogenesis

A powerful example of exploiting the network neighbour-hood of known disease genes to get mechanistic insight is givenin [20] Using a set of 129 asthma implicated (lsquoseedrsquo) genes thatmap to the human interactome together with a novel algorithm[16] for module detection Sharma et al [20] identified a poten-tial asthma disease module containing 441 genes including91 seeds A large number (162) of pathways had at least half oftheir genes in the putative disease module However a strategy

Table 2 List of abbreviations

Abbreviation Expanded name Comment

API Application programming interface A set interfaces tools and functions used for creation of applicationsBEL Biological Expression Language A curation language for structured capturing of data about biological systems and

experimentsBELIEF BEL Information Extraction workFlow

systemA framework for automated parsing of information into BEL format

BioPAX Biological Pathways Exchangelanguage

A standard that formally defines biological pathway conceptualization in OWLformat

BMP Big Mechanism Project A project by US Department of Defense for automated construction of mechanisticcancer models from scientific literature

COPD Chronic Obstructive PulmonaryDisease

A lung disease characterized by impeded breathing and phenotypically similar toasthma

Cypher A query language for Neo4J graph databaseDBMS Database management system A software providing capabilities to mediate storage query and manipulation of

dataEBI European Bioinformatics InstituteEFO Experimental Factor OntologyGO Gene Ontology One of the most widely used ontologies for representing functions of biological

entitiesGWAS Genome-wide association study An observational study that relates germ-line variations between individual to

phenotypesIR Information Retrieval A process of extraction of relevant information from some wider supersetKappa A rule-based declarative language used mostly in molecular biology domainNeo4j Graph-based database solutionOBO Open Biomedical Ontologies

initiativeOBO format is alternative language for authoring ontologies mainly used in biolo-

gical sciencesOWL Ontology Web Language Currently most widely used language for authoring ontologiesPD Process Description language An extension of SBGN standard with additional capabilities to represent temporal

aspects of biological processesPPI Protein-protein interactionRDF Resource Description Format Core data exchange format for the Semantic WebSBGN Systems Biology Graphical Notation A standard for graphical representation for biological and biochemical domainSBML Systems Biology Markup Language An XML-based format for storing and exchanging biological modelsSBO Systems Biology OntologySNP Single Nucleotide Polymorphism A single-base variant or mutation in a genomic sequenceSQL Structured Query Language Family of similar query languages used for data management in relational

databasesTM Text Mining A process of extraction of structured data from free textTMO Translational Medicine OntologyURL Uniform Resource Locator Web address resolving to a resource on the WebXML Extensible Markup Language Popular meta-language for document markup

4 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

involving prioritizing genes on the basis of their asthma rele-vance using gene expression and GWAS information enabled aranking of pathways and subsequent identification of path-ways not traditionally associated with asthma The plausibilityof one of these (the GAB1 signalosome) was shown experimen-tally The authors also explored the differential expression ofthe GAB1 gene in other immune-related diseases This study il-lustrates the use of both pathways and networks with othersupporting evidence (eg gene expression GWAS) to context-ualize disease genes and provide new avenues for understand-ing disease mechanisms It highlights the need for knowledgerepresentations that can capture multiple levels of informationsuch as the conditions in which a gene may be up-regulated

One approach to representation of the complex networks thatare associated with disease is to decompose the information intolayers with each layer representing information at different lev-els of biological organization Radonjic and co-workers [21 22] usenetwork-based methods to represent and integrate informationat three levels namely differentially expressed biological proc-esses transcription factors whose target proteins are differen-tially expressed and physiological information such as energyintake and weight They explored adaptation of white adipose tis-sue to a high-fat diet a biological system that may give clues as towhich molecular processes are involved in obesity and type 2 dia-betes Gustafsson et al [17] discussed the concept of multilayerdisease modules as a possible framework to generate predictivemarkers that capture multiple types of disease-related informa-tion (PPI symptoms etc) The integration of the layers and the in-clusion of addition contextual information associated with thenodes and relationships of the network (eg cell-type specificities)are important for mechanistic insight into disease Network-based data fusion approaches to integration of multiple typesof biological information relevant to disease are described by[23 24] and these offer the potential to predict new disease-asso-ciated genes as well as to explore diseasendashdisease relationships

The network-based integration of different types of informa-tion can suggest underlying mechanisms of disease Huan andco-workers [25] perform an integrated analysis to explore mech-anisms of blood pressure regulations They used weighted geneco-expression network analysis (WGCNA) [26] to identify co-expressed modules that showed correlation to blood pressuremeasurements and integrated SNP data as well as PPI networksand used Bayesian analysis to identify key drivers of the biolo-gical process one of which was selected for further experimen-tal study The identification of associations between diseasescan suggest a common mechanistic basis [27] and can be usedpredictively [28] Molecular network approaches to representingdisease landscapes offer particular views rather than a unifiedframework In some ways they are similar to information over-lay methods in data analytics Points of correspondence be-tween the different views (networks) need to be found to get anintegrated picture that will be useful for suggesting mechanism

To illustrate these contextualization strategies 22 genesfrom the differential expression signature (derived fromGSE43696 study as outlined in Supplementary Methods) weremapped to protein association network from STRING databaseFirst the results were explored using Web-based interactivevisualization offered by STRING database (Figure 2) As proteinsrealize their function through complex sets of interactions im-plications of gene expression changes can often be betterunderstood by assessing its impact on the wider interactomeThe top panel shows disease signature genes as well as 50 genesfrom the first shell identified as most relevant to this input Theanalysis has suggested several high-confidence modules most

of which contained prominent asthma-associated signallingpathways (HIF [29] TGF-beta [30] and cytokinendashcytokine receptorinteraction [31] and circadian clock-related [32]) Several high-degree nodes connecting these multiple modules are alsoknown to be implicated in asthma tyrosine kinase FYN isinvolved in inflammation [33] IL1B shown to be geneticallyassociated with asthma susceptibility [34] and the node withthe highest degree (SMAD4) is involved in airway smoothmuscle hyperplasia [35]

Given that the number of identified human PPIs is now inthe millions visual exploration is typically limited by the sheeramounts of relevant information that can be usefully displayedApproaches like network diffusion and graph random walksoffer a mathematically robust way to select most relevant nodeseven in large graphs These methods are effective for diseaseanalysis as disease-related genes are commonly located closeto each other in the network In this example we have demon-strated the applicability of this key principle and also identified10 most relevant (closest to the signature) genes by the diffusionstate distance metric [36] This analysis highlighted potentialrelevance of the PTEN gene known to be prominently involvedin airway hyperresponsiveness and inflammation [37] as wellas a small module with multiple members of the PI3KAkt path-way that promotes asthma-related airway remodelling [38]

Contextualization by pathway centricrepresentations

Understanding the mechanism of disease often starts with iden-tification of the aberrant molecular pathways that may be asso-ciated with the disease Molecular pathway models aim tocapture sequences of actions or events of molecular lsquoactorsrsquoSome examples of commonly modelled events include biochem-ical reactions state changes and movements between differentcellular compartments Mapping known disease genes to knownpathways is a well-established methodology for contextualiza-tion as pathways capture an intermediate level of organizationbetween molecular entities and phenotype The identification ofhallmark pathways believed to be associated with disease condi-tions has been used to create pathway centric disease maps [39]which are useful for visual exploration and for the overlay ofdata from for example gene expression studies Typically suchapproaches involved identification of disease-related pathwaysby careful manual curation of the literature AlzPathway [40] is acollection of signalling pathways in Alzheimerrsquos disease Themap is represented in the Systems Biology Graphical Notation(SBGN) [41] process description (PD) language and was also madeavailable in other formats The resource also offers a Web inter-face that allows users to review and comment on annotationsthus facilitating community curation Another example of apathway-based disease map is the Parkinsonrsquos disease map [42]which builds on an understanding of the hallmark metabolicregulatory and signalling pathways of the disease [39] TheParkinsonrsquos disease map can be accessed using the Minerva Webserver which has advanced functionality for visualization (usingthe GoogleMaps API) that facilitates manual exploration [43]The Atlas of Cancer Signalling Networks [44] uses a similar ap-proach where several types of biological entities like pheno-type ion and drug can be represented on a zoomable map inSGBN format Clicking on an entity brings up a box with its de-scription and relevant annotations

Ideally a computational framework for contextualizing dis-ease-implicated genes should allow annotation by members of

Navigating the disease landscape | 5

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

the research community and some selected frameworks offerthis important functionality WikiPathways [45] contains a setof open evolving updateable disease-associated pathwaysPathWhiz [46] is a Web-based tool that allows users to drawpathways so as to include contextual information such as cellstissues and organs The pathway maps are designed to be visu-ally appealing to facilitate manual exploration PathWhizincludes metabolic signalling as well as a number of disease-associated pathways Recently the PathWay Collage tool [47]has been developed that allows the visualization of fragmentsof pathways (defined by the user so as to reduce the complexity

of full pathway representations) onto which omics data can beoverlaid

At present there are two leading exchange standardsadopted for representing biochemical pathway networks Thefirst one is Systems Biology Markup Language (SBML) [48] whichwas primarily intended to support development and sharing ofquantitative biochemical models Therefore SBML has been de-signed to be generic with the main building blocks being spe-cies (quantifiable physical entities) processes (which definehow entities are manipulated) and compartments (provide low-level context for processes and entities) As SBML format aims

BLM

CHEK1PLXND1TIPIN

SEMA3ECLSPN

ATRMAP3K6

HNMT

KCNN4MCM2DPPA5

CMCMMMMMMMMMMMMMMMMMMMMMMMMTIMELESSPER1

ARNTL SYT13

CLOCKBMP5CAMK2A GDF5GDGD

SMAD1CRY2 CAMK2D

SMAD4CAMK2G BMPR2MYO19 MPR2

ACVR2BSMAD5CAMK2BGRIN2B

FHOD1 SMAD9GRIN2A

D9D99D9D99D99999999999999999999999999DD9D99999999999999999999999999999

BMPR1ASTAP2 BMP4

CALM2 BMP10BMP7CALM1

BMP2KRT73

PTK6 SMAD6NOG

SMAD7KHDRBS1KHDRBS2 FYN

77777777777777777

BMP6IL1B

IL1A

HSD17B13 IL1R2UTRN

WNK4 IL19IL20RBDAG1

RAPSNNPHS1NNNCD2AP

IL20AGRN

IL20RA

SLCO1B3NPHS2NNN

KIRRELIL22RA1

CPXM1NAT8

ab

c

1

2

3

4

lt 22eminus16

000023

0000000000011

10

20

30

Random

DisGeNET signature

Severe Asthma signature

Severe Asthma vs DisGeNETYWHAQ

FLT3YWHAE ILK

YWHABYWHAG

YWHAZ SFNPTEN

MATK

Diffu

sio

n s

tate

dis

tan

ce

Distances between

members of gene signatures

Ten closest genes to severe

asthma gene expression signature

ILK

FLT3

PTEN

YWHABB

YWHAQYWHAE

MATK

YWHAZ

SFNYWHAG

A

B C

Figure 2 Analysis of severe asthma differential expression gene signature in the context of STRING protein association network (A) Fifty genes in the first shell of the

22-gene signature (andashc) high-degree module interconnectivity genes (1) calmodulins and HIF-1 signalling pathway (2) TGF-beta signalling pathway (3) cytokinendashcyto-

kine receptor interaction pathway and (4) circadian rhythm-related genes (B) Diffusion state distance of gene sets this measure is derived from similarity of random

walk profiles for each pair of nodes In greymdashrandom pairs of nodes Distances between members of asthma signatures to each other are shown in blue (severe asthma

versus DisGeNet category shows distances between members of the two sets) Finally 10 closest genes to the differential expression signature set are shown in green

(C) Ten closest genes to the severe asthma signature visualized in the STRING database viewer

6 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

to be generic further subcategories of these concepts are notdefined as part of the standard and instead particular em-phasis is placed on enabling mathematical definition of themodel (eg in the form of ordinary differential equations) Someof the necessary biological underpinnings are added via arelated project SBGN [41] standard which defines how differentSBML entities can be represented graphically In SBML fine-grained annotation or specification of complex contexts is pos-sible through special fields set aside for user data However asthese fields are not bound by the standard they may not besupported by all of the tools and therefore generally would notbe used in evaluation of the model

The SBGN PD language [41] was designed to give an unam-biguous representation of mechanism of action within a biolo-gical pathway that can be interpreted visually to facilitatemanual exploration The PD language can represent all thesteps in for example a reaction showing how a biological entitychanges in a temporal manner However to fully exploit the PDrepresentation for applications like numerical modelling wouldrequire additional data associated with component steps (suchas rate constants) which often is not readily available

Another prominent data exchange standard is BioPAX [49]which was primarily designed for unambiguous sharing of bio-chemical pathway data and therefore offers much wider selectionof domain-specific terms to characterize and categorize bothinteractors and interactions As Ontology Web Language (OWL)forms the core of this standard all of the terms are organized hier-archically and relationships between them are formally definedBioPAX standard aims to facilitate data sharing and integrationvia Semantic Web technologies and Resource Description Format(RDF) If an RDF representation is used the format can be easilyextended to incorporate complex context information howeverBioPAX currently does not aim to offer the means of defining ex-perimental or biomedical context for pathways as by designother standardsontologies can be brought in to model them

Both SMBL and BioPAX are predicated on the developmentsof the past decade when several key technologies and designprinciples were introduced that greatly influenced how biomed-ical data-sharing standards are implemented One of these de-velopments was increasing adoption of formal ontologies thatallowed development of well-structured controlled vocabulariesfor different domains and definition of cross-mappings betweenthem For example the SBML standard allows annotation ofmodels with System Biology Ontology (SBO) terms which arealso used in SBGN and can be mapped on to BioPax via SBPAX[50] Therefore in principle it is possible to integrate modelsacross all of these standards on a qualitative level On a syntac-tic level different formalisms can be integrated by introducingsupport for a low-level common language like RDF which isrepresentation-agnostic and allows all types of data to be linkedvia common identifiers Both of these solutions enable modernstandards to be designed in a modular way where differentspecialized representations can be developed by different com-munities of domain experts and then combined togethermdashoreven extended for the needs of a particular applied project

In recent years the community efforts have been increas-ingly devoted to increasing inter-compatibility across all mainbiomedical standards These efforts include development ofconverters controlled vocabulary mappings and better supportof different standards both by data providers and analytical tooldevelopers Greater compatibly also means that smaller models(eg from individual studies) can now be meaningfully sharedand combined One prominent resource that aims to facilitatethis process is NDex [51] NDex allows researchers to share their

models in a variety of different formats which can then be useddirectly by using the NDex website like a data warehouse orintegrated into a common representation by using the NDexCytoscape [52] plug-in

Pathway-based representations enable relevant methods toincorporate the directionality and interaction type (eg up- ordown-regulation) and in this way they can capture causal as-sociations underlying different disease mechanisms To illus-trate this application we have used lsquoTied Diffusion throughInteracting Eventsrsquo method [53] which can identify significantlyimplicated pathways connecting two sets of genes (in this caseDisGeNET asthma signature and our severe asthma signature)The diffusion model used by this method can take into accountmagnitude of effects direction and type of interactions Afterapplying this approach the regulatory subnetwork fromReactome found to be significant for these two sets wasvisualized in Cytoscape (Figure 3) It is possible to see that twointerleukin genes (both involved airway inflammation [54])were found to be particularly critical with lower panel showingkey associations involved

Contextualization by knowledgegraph representations

There are many specialist sources of disease-related informa-tion such as databases of mutations and their consequences Inaddition there is vast information in unstructured formats inthe scientific literature and much of it is contextual for ex-ample statements that describe correlations between geneproducts or between small molecules and gene products thatare observed in particular cell types Information in the litera-ture may also describe weak associations only seen in certainconditions or for which there may be limited evidence thatcould nevertheless still be useful for hypothesis generation

The Biological Expression Language (BEL) [55] representsstatements in biology typically correlative and causative state-ments as well as the context in which these statements applyA biological statement is represented as a triple (subject predi-cate and object) where the subject is a BEL term and the predi-cate is a biological relationship that connects the subject withobject A BEL term can be the abundance of a biological entity orit can be a biological process such as one described in the GeneOntology (GO) [56] or even a given disease condition Each en-tity is described in an existing namespace (eg from databaseslike ChEBI [57]) or in particular user-defined namespace(s)There is a fixed set of causal correlative and other relationshipsthat link these entities The BEL complier checks that the state-ments are syntactically correct and carries out the integrationand alignment of the data by establishing equivalences betweenidentical concepts

The knowledge captured by BEL statements can be repre-sented as a graph The object can be a BEL term or a BEL state-ment Each statement can be associated with metadata thatcontextualizes it for example qualifying it to be true only inspecific tissues The BEL knowledge network can facilitate datainterpretation using a reverse causal reasoning approach [58]For example gene expression data can be mapped to smallersubnetworks that represent cause and effect models The extentto which the models can explain the measurements of differen-tially expressed genes can then be assessed and potentiallyyield insight into the mechanism

Hofmann-Apitius [5] posits that an understanding of diseasemechanisms involves integrating information at multiple

Navigating the disease landscape | 7

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

biological scales To some extent BEL offers a multiscale repre-sentation that other languages do not Using BEL Hofmann-Apitius and co-workers have explored mechanistic aspects ofAlzheimerrsquos disease [59ndash62] They used the BEL language to rep-resent information from the literature together with relevantexperimental data sets to build integrated evidence networksThis approach helps to mine what they refer to as the know-ledge lsquogrey zonersquo which includes both established and emerg-ing connections between entities This approach can bepowerful for linking mechanisms associated with biomarkers[60]

Constructing such information-rich representations can in-volve considerable manual curation efforts For this reason TM

will be increasingly important for supporting this process [63]Already existing examples include the BEL InformationExtraction workFlow (BELIEF) system and BELSmile [64] Theseframeworks work by using automated TM to detect BEL con-cepts and relationships and then allowing the curators to ex-plore and annotate their biological contexts to create final high-quality networks

Another example of a specialized curation data modellingformat is the Nanopublication standard [65] Nanopublicationdefines rules for publishing data in its simplest possible for-msmdashindividual assertions in the form of statements composedof a subject predicate and object plus supporting provenanceinformation and relevant metadata Nanopublication is

Figure 3 Significant pathways identified using Tied Diffusion through Interacting Events method an approach that can extract the most likely set of interconnectors

between two sets of nodes In this example core asthma genes from DisGeNET and severe differential expression signature were considered in the context of

Reactome regulatory pathway network In both panels colour intensity indicates weight magnitude assigned by the algorithm (A) Complete set of all significant genes

Dashed lines indicate undirected edges of lsquocomponentrsquo type whereas solid edges show directed interactions (B) subset of the network relating to the IL1RN gene

Arrow and bar-terminated edge styles show activation and inhibition respectively

8 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

designed to be used in combination with Semantic Web tech-nologies and uses RDF a format that allows seamless incorpor-ation of ontology terms and cross-references through globallyunique shared identifiers The advantages of this standard in-clude decentralized (federated) publishing machine-readablerepresentation of data and an ability to track provenance at thehighest possible level of granularity

Knowledge network representations of biological data are in-creasingly being leveraged in in integrative analysis platformswhich aim to combine information resources with commonanalysis and visualization tasks One example of such platformsis Ingenuity Pathway Analysis (IPAVR ) IPAVR suite aims to facili-tate interpretation of experimental data from high-throughputomics experiments and includes methods for identification ofcausal networks and visual exploration of biological graphs [66]The BioMax BioXMTM platform uses an object relational data-base architecture to store query and envision disease informa-tion including molecular genetic physiological and clinicaldata as well as biological models This platform has been usedto create a resource for chronic obstructive pulmonary disease(COPD) called lsquoCOPD knowledge basersquo [67ndash70] The authors ofCOPD knowledge base do acknowledge that much disease-related information remains hidden in the literature and alsorecognize the difficulties of capturing quality and context speci-ficity The Malacards resource [71] uses a relational database forthe semantic integration of multiple sources of informationabout diseases and also provides network visualizations of dis-easendashdisease relationships GeneAnalytics [72] enables context-ualization of functional signatures (gene sets) in the context oforgans and tissues and compartments as well as other annota-tion sources building on the GeneCards [73] and Malacardstechnologies

Recent advances in graph-based databases have opened upadditional possibilities for organizing heterogeneous and inter-linked data Graph databases have several features that makethem particularly attractive for management and exploratoryanalysis of complex biological data sets A particularly notewor-thy feature is excellent performance of traversal-type queriesacross entities of diverse types they offer Such queries could bechallenging to realize in relational databases because of the costof computing statements across different tables Traversalqueries can be an important way of identifying new relation-ships in integrated data particularly in the cases where linksonly become apparent once multiple sources have been inte-grated but may not be immediately obvious because of sheeramounts of data being involved Modern graph databases offersophisticated query languages and often also other types offramework-specific functionality to facilitate applied analysislike an ability to extend the framework with user-defined func-tions [74] and integration with ontologies and reasoning sys-tems One example of a graph database solution that has beenalready used in biomedical domain [75ndash78] is Neo4J DBMSNeo4J is a Java-based solution which also offers its own graphquery language (Cypher) which is conceptually similar to SQL

Semantic Web [79] RDF [80] and Linked Open Data [81] areemerging important standards for sharing biomedical data Atits core Semantic Web is characterized by its use of globallyunique unambiguous identifiers for entities The identifier usedfor this purpose is commonly a web link (URL) The RDF formatis structured as a set of statements with three parts (subjectpredicate and object) where a subject and object are entitieswith a unique identifier and predicate defines the type of a re-lationship between them Linked Open Data standard estab-lished a set of rules for providing meaningful RDF at the

resolvable identifier URL which generally is in the form of state-ments that provide meaningful information about that entitythat can in turn link it to other URL-identified entities and fi-nally the entities and statements can be linked to specializedontologies in OWL [82] that provide additional informationabout how to interpret them within a particular context

From the perspective of biomedical data integration RDFprimarily plays the role of a low-level data exchange standarduseful for addressing the syntactic heterogeneity issue andfacilitating data integration Although RDF representations canbe interpreted as integrated knowledge networks usually anadditional set of standards will be necessary for meaningful in-terpretation of data from a biological perspective Prominent re-sources that offer biological data on the Semantic Web includeEBI RDF platform [83] (covers such important databases likeUniProt [84] Reactome [85] Ensembl [86] ChEMBL [87] andExpression Atlas [88]) Bio2RDF (aims to mirror in RDF most ofthe prominent biological databases) [89] and OpenPhacts (usesSemantic Web technologies to represent the chemogenomicsspace) [90] Some noteworthy ontologies used in combinationwith these resources include Experimental Factor Ontology(EFO) [91] SBO [92] Translational Medicine Ontology (TMO) [93]SNOMED (medical terms and concepts) [94] and GO (functionalannotation of genes and proteins) [56]

To illustrate the application of integrated knowledge net-works with our example 22-gene severe asthma signature wehave used a Neo4j-based Hetionet v10 resource [95] that inte-grates a wide variety of biomedical information First a querywas prepared to visualize all of the diseases linked to the sig-nature genes (Figure 4) According to these results only onegene has already been linked to asthma though three othergenes have been linked to other lung diseases (COPD and idio-pathic pulmonary fibrosis) Next the connections betweenknown asthma-related pathways severe asthma differentialexpression signature and drugs targeting those common path-ways were retrieved In terms of query-specific degree of drugnodes (ie number of links to returned proteins) top threedrugs were dexamethasone betamethasone and niclosamideThe former two are corticosteroids already used for treatingasthma [96] and the last one was recently proposed as a highlypromising treatment because of its strong TMEM16A antagon-ism found to improve lung function in human and mousemodels [97]

Approaches for extraction ofcontextual knowledge

Ultimately all types of models in biomedical domain originatefrom data collected during relevant experiments and such datawould typically require both different levels of preprocessing andinterpretation to yield such models At present relatively fewtypes of raw experimental data can be straightforwardly con-verted into these representations without requiring human inter-pretation Some possible examples are construction of co-expression networks [98 99] from transcriptomic data and se-quence homology graphs [100 101] However currently manualinterpretation and publishing of inferred models as free text inscientific papers remains the primary means of disseminatingthis knowledge and all major biomedical databases discussed inthis review expend considerable curation efforts to collect verifyand consolidate it in standardized formats from source literature

Given the volume and ever-increasing rate of scientific pub-lishing automated means of facilitating this process are

Navigating the disease landscape | 9

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Figure 4 Query expressed in natural language and resulting output produced for the differential expression gene signature using Hetionet graph database The follow-

ing entities are shown proteins (blue) pathways (yellow) diseases (brown) and drugs (red) (A) Query for associated diseases (B) query to explore the connections be-

tween niclosamide drug asthma and differential expression signature linked via relevant proteins and pathways

10 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

becoming increasingly essential This is achieved through appli-cations of information retrieval (IR) and TM methods which canhandle tasks like identification of relevant documents sectionsand sentences recognition of concepts and relationships be-tween them or even generation of novel hypotheses [102]Introduction of computational text analysis methods into thecuration process has been reported to improve its efficiency bybetween 2- and 10-fold [103] Especially in the biomedical com-munity development of TM approaches has greatly benefitedfrom competitive evaluation challenges [104] Some prominentexamples of such initiatives include BioCreative TRECGenomicsMedicalCDS BioNLP-ST i2b2 and ShAReCLEFeHealth These regular competitions provide a valuable bench-mark for current state of the art for different tasks and serve tohighlight the most important problems focus community ef-forts and establish necessary annotated textual resources

As well as assisting with curation both TM and IR can beapplied in a high-throughput and automated manner howeverthe utility of these applications is still limited by accuracy andgeneralizability [103] In its simplest form various statistical co-occurrence methods can be applied to build networks linkingbiological concepts frequently found together in differentpapers Some examples of resources offering such literature-derived data are miRCancer that compiled data about co-occurrence of microRNA mentions with different cancer types[105] and DISEASES resource which provides similar informa-tion for gene mentions with particular diseases [106]

Extraction of information from text is contingent on unam-biguous identification of entities and relationship types fromwhich model statements can be constructed Therefore TMtechnologies have a natural synergy with ontology-based datamodelling and to a lesser extent Semantic Web formats likeRDF [102 107] The ontologies are used both as controlled vocab-ularies to resolve alternative naming conventions and as sour-ces of common identifiers for recovered concepts andrelationships Some examples of specialized relationship typesthat can be recovered include proteinndashprotein [108] drugndashpro-tein [109] and drugndashdrug [110 111] interactions and associ-ations of genes with mutations [112 113] and phenotypes [114ndash116] Data from IR and TM are increasingly incorporated along-side manually reviewed information by major databases for ex-ample PPIs extracted from literature by fully automatedapproaches are offered by STRING [13 117]

To handle more advanced tasks it is usually necessary tocombine specialized tools into TM workflows eg a hypothesisgeneration tool might rely on output of a relationships identifi-cation tool which in turn relies on named entity recognitionInevitably this raises questions of interoperability of differentanalysis methods One way to address this is has been throughincreasing use of Web services to link up different tools [118]Additionally specialized standards for biomedical textual datainterchange were developed for example XML-based BioCformat [119]

Recent efforts have been exploring the possibility of extract-ing more complex statements from text [120] and in 2015 theBioCreative V challenge has for the first time included a task ofautomated extraction of OpenBEL statements from text [121]During the competition the best result was achieved by theBELMiner framework [122] which used a rule-based approachfor extraction of BEL statements from text and got the highestF-measure of 20 for this task After the challenge BELSmilesystem was developed that reached an even better score of 28on the same dataset [64] BELSmile implemented a semantic-role-labelling approach which relies on recognition of verbs

(predicates) and assignment of roles to associated subjectsob-jects relative to that verb Notably designs of both frameworksintegrated multiple specialized IR and TM tools to handle spe-cific low-level subtasks

It is clear that at present a substantial proportion of valuablebiomedical knowledge is lsquotrappedrsquo in scientific text in a repre-sentation not readily suitable for automated computer-driveninterpretation Appropriate data formats and ontologies are anessential prerequisite for enabling sophisticated TM methods tobegin addressing this challenge The most recent developmentsindicate that current TM approaches are becoming powerfulenough to leverage such rich representations and are alreadyuseful to curators working on complex models of human dis-eases To explore further developments in this important areawe direct reader to the following recent reviews [25 102 123]

Outlook

Many translational and systems medicine projects involve col-lection of molecular and clinical data with a view to the identifi-cation of disease subtypes The data are typically warehousedand analysed for example to extract molecular fingerprintsthat characterize the subtypes and this data interpretation stepof the translational informatics pipeline remains highly chal-lenging Moving from diagnostic molecular patterns to beingable to suggest individualized healthcare pathways is complexData integration links the molecular patterns to backgroundknowledge which can then be reviewed explored and analysedto develop a more detailed understanding of the underlyingbiology that distinguishes disease subtypes

To support this process several standards have emerged tocapture integrate and facilitate analysis of these data This pro-liferation of standards suggests that different formalisms willcontinue to be necessary to model different aspects of biomed-ical data and that we are unlikely to converge on one ultimatemodelling solution in the foreseeable future The open questiontherefore becomes how to effectively manage the interoperabil-ity between these different representations In this respect sev-eral common trends and best practices are beginning to emergeIn particular it is now evident that development of commoncontrolled vocabularies and specialized domain ontologies is es-sential for effective management of increasingly complex bio-medical data Use of ontologies facilitates identification ofequivalent entities and concepts across different modelling for-mats and facilitates conversion between them allowing effi-cient data integration

As outlined by Cohen in [6] modern biomedical data model-ling formalisms can be categorized into lsquocurationrsquo and lsquomechanis-ticrsquo types Aim of the former is to primarily describe experimentalresults (eg Nanopublications BioPAX) including observed asso-ciations between molecular signatures and diseases The latterspecify evaluable models used to simulate living systems andgenerate predictions based on a set of inputs (eg SBML Kappa)It is obvious that although experimental observations are ultim-ately used to construct such models the processing stepsrequired to correctly generate the latter from the former wouldgo beyond a simple format change We believe that automationof this process is a highly promising emergent strategy for devel-opment of fully mechanistic models of disease Use of automatedmethods to support biomedical innovation is necessitated bothby the need to effectively deal with ever-growing volumes of dataand by the increasingly complex nature of biomedical researchitself The apparent diminishing returns in terms of noveldrugs successfully taken to market relative to the money spent

Navigating the disease landscape | 11

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

[124 125] with particular challenges for cardiovascular [126] andneurodegenerative diseases [127 128] mean that increasinglymore complex strategies for managing disease are needed tomake progress [129] Consequently high-throughput computa-tional approaches are becoming indispensable both for assistingwith interpretation management and mining of experimentaldata as well as for constructing and evaluating relevant clinicalmodels At present cutting-edge efforts in this area involve cre-ation of increasingly sophisticated automated methods for ana-lysis of vast quantities biomedical data exemplified by initiativeslike DARPArsquos Big Mechanism IBM Watson Health and Garudabiomedical discovery platforms The future modelling formatsare therefore likely to be influenced by the requirements of suchprojects and will increasingly incorporate features to supportfully computer-driven information analysis

In our view accurate handling of the context of biological ob-servations is critical for construction of correct mechanistic mod-els of disease Traditionally information has been compiled intolargely homogenized collections like pathway maps or globalinteractome networks that aim to show all possible processes of acertain type for a given organism However in practice only asmall subset of all these processes will be occurring in a real cellat a given time Furthermore such events may be transient (en-zymatic reactions) have stable outcome (protein complex forma-tion) or be mutually exclusive (alternative protein-bindingpartners) An additional level of complexity is introduced fromthe imperfection of experimental techniques For example tran-scriptomics profiling is often done at a tissue level which maymask important differences at a level of individual cells As suchmechanistic models built from these data may be subject to con-siderable uncertainty and abstraction The next generation of for-mats and integration solutions will therefore need to be morecontext aware Future progress on one hand requires bettermeans of expressing this highly granular context of biologicalprocesses and on the other necessitates solutions to model vastvolumes of data and knowledge in such form Once efficient andaccurate approaches for these tasks are established cutting-edgemathematical and machine learning methods could be leveragedto their full potential to give new insight into disease mechan-isms and suggest improved avenues for therapeutic intervention

Key Points

bull Precision medicine relies on identification of disease sub-types at a molecular level and linking them to diagnosticmodels to determine optimal treatment strategies

bull In practice establishing a contextual link between molecularsignatures and disease processes requires integration ofqualitatively diverse types of relevant data and knowledge

bull At present several alternative philosophies guide theprocess of transformation of experimental data intomechanistically informative and ultimately clinicallyactionable insights

bull The spectrum of possible approaches ranges frompurely associative knowledge link discovery to fullyquantitative mathematical models

Supplementary Data

Supplementary data are available online at httpsacademicoupcombib

Funding

The research leading to these results has received supportfrom the Innovative Medicines Initiative Joint UndertakingeTRIKS Project (Grant Number IMI 115446) resources ofwhich are composed of a financial contribution from theEuropean Unionrsquos Seventh Framework Programme (FP72007-2013) and European Federation of PharmaceuticalIndustries and Associations (EFPIA) companiesrsquo in kind con-tributions The work was also partially supported by CoreResearch for Evolutional Science and Technology (CREST)Grant from the Japan Science and Technology Agency(Grant Number JPMJCR1412) and Japan Society for thePromotion of Science KAKENHI (Grant Numbers 17H06307and 17H06299)

References1 Robinson PN Deep phenotyping for precision medicine

Hum Mutat 201233(5)777ndash802 Glaab E Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen classificationBrief Bioinform 201617(3)440ndash52

3 Woodruff PG Boushey HA Dolganov GM et al Genome-wide profiling identifies epithelial cell genes associated withasthma and with treatment response to corticosteroids ProcNatl Acad Sci USA 2007104(40)15858ndash63

4 Kaneko Y Yatagai Y Yamada H et al The search for com-mon pathways underlying asthma and COPD Int J ChronObstruct Pulmon Dis 2013865ndash78

5 Hofmann-Apitius M Ball G Gebel S et al Bioinformaticsmining and modeling methods for the identification of dis-ease mechanisms in neurodegenerative disorders Int J MolSci 201516(12)29179ndash206

6 Cohen PR DARPArsquos big mechanism program Phys Biol 201512(4)045008

7 Barabasi AL Network Science Cambridge United KingdomCambridge University Press 2016

8 Le Novere N Quantitative and logic modelling of molecularand gene networks Nat Rev Genet 201516(3)146ndash58

9 Boccaletti S Bianconi G Criado R et al The structure and dy-namics of multilayer networks Phys Rep 2014544(1)1ndash122

10 Vidal M Cusick ME Barabasi AL Interactome networks andhuman disease Cell 2011144(6)986ndash98

11 Pi~nero J Queralt-Rosinach N Bravo A et al DisGeNET a dis-covery platform for the dynamical exploration of humandiseases and their genes Database 20152015bav028

12 Voraphani N Gladwin M Contreras A et al An airway epi-thelial iNOSndashDUOX2ndashthyroid peroxidase metabolome drivesTh1Th2 nitrative stress in human severe asthma MucosalImmunol 20147(5)1175ndash85

13 Franceschini A Szklarczyk D Frankild S et al STRING v9 1protein-protein interaction networks with increasedcoverage and integration Nucleic Acids Res 201241(D1)D808ndash15

14 Montojo J Zuberi K Rodriguez H et al GeneMANIA fastgene network construction and function prediction forCytoscape F1000Res 20143153

15 Goh KI Cusick ME Valle D et al The human disease net-work Proc Natl Acad Sci USA 2007104(21)8685ndash90

16 Menche J Sharma A Kitsak M et al Uncovering disease-disease relationships through the incomplete interactomeScience 2015347(6224)1257601

12 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 3: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

single-nucleotide polymorphisms (SNPs) information from pro-teinndashprotein interactions and sequence similarity informationon the effects profiled in mutational studies in modelorganisms and the relationships of molecular variations withphysiological and cellular changes The suitable representationof this information in a manner that can be efficiently storedenvisioned and queried is important for obtaining a mechanis-tic understanding of disease

What is the most appropriate way to represent disease-related information in a computable format that can be usefulfor hypothesis generation and for obtaining mechanistic in-sight To some extent this depends on what questions arebeing asked of the data Here we describe three approaches torepresentation of contextual information namely pathway-centric approaches molecular network centric approaches andinformation graphs (Figure 1) We contend that data integrationis a key component of all three approaches

Network and graph-based representations are prodigiouslyeffective representations for integration query exchange andvisualization of numerous possible relationships and inter-actions between biological entities Owing to well-studied for-mal mathematics of graphs they also often serve as a key datastructure for algorithmic machine learning and statisticalapproaches for analysis of these data Construction of biologicalnetworks from high-throughput experimental data is an essen-tial first step that enables many types of analysis discussedbelow While we recognize importance of this research in thisreview our focus will be primarily on downstream applicationsof biological and knowledge networks for molecular signaturecontextualization but for completeness we would like to rec-ommend the following works which cover this closely relatedtopic in detail [7ndash10] For convenience of readers we haveincluded a list of abbreviations used throughout the article inTable 2

To provide a practical illustration of how different types ofcontextualization approaches can be used we have includedexamples based on a severe asthma differential expressiongene signature and a manually curated core asthma gene signa-ture acquired from DisGeNET database [11] The differential ex-pression signature was computed from the study by Voraphaniand Gladwin [12] (GSE43696) which profiled transcriptomics of

moderate and severe asthma as well as healthy controls A fulldescription of all analysis performed is outlined in detail in theSupplementary Methods These examples were developedpurely to illustrate the approaches discussed in this review andpotentially other examples relevant to asthma could have beenchosen

Contextualization using integrated networks

Networks or graphs are sets of nodes and edges where a noderepresents a concept and an edge represents a relationshipBiological concepts can correspond to real physical entities likemolecules genes and proteins or represent more abstract enti-ties like typed groups (pathways functional categories) and dis-eases The relationships represent a meaningful associationbetween these concepts such as lsquoprotein A interacts with pro-tein Brsquo lsquoenzyme participates in pathwayrsquo or lsquoprotein is a tran-scription factorrsquo In many network studies in systems medicinethe nodes are restricted to particular single type (such as pro-tein) and such graphs are termed homogeneous In this inter-pretation an edge can represent a single relationship type suchas a proteinndashprotein interaction or it can represent multipletypes of evidence that associates the two protein nodes For ex-ample in the STRING proteinndashprotein interaction (PPI) database[13] evidence types from multiple sources are used to suggestfunctional associations between proteins The evidence typesinclude genome location co-occurrence in the scientific litera-ture and reported associations in other databases Mechanisticinsight into diseases can be obtained by mapping disease-asso-ciated genes (eg identified from omics data analyses) to suchnetworks The network neighbourhood of the disease genesprovides a context and can suggest new candidates as well ascommon themes at the level of molecular processes and path-ways that may be associated with disease pathogenesisGeneMania [14] is a Cytoscape App that enables the construc-tion of network neighbourhoods from a set of seed genes by in-tegration of a number of constituent networks (eg PPI sharedprotein domains co-expression)

Network representation can be leveraged to discover high-level patterns that underlie organization of complex biologicalsystems and more specifically identify patterns relating to

Figure 1 Examples of representations for contextualizing disease associated genes A pathway centric view of the network neighbourhood of ALOX5 from

ReactomeViz using the Cytoscape plug-in (left) A molecular network centric view using the GenMania Cytoscape plug-in (middle) A heterogeneous network includ-

ing proteins pathways and diseases constructed and displayed using Neo4j (right)

Navigating the disease landscape | 3

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

development of disease One prominent early example of thiswas from the work of Barabasi and co-workers [15] which rep-resented diseasendashgene relationships as a bipartite graph fromwhich monopartite diseasendashdisease and genendashgene graphscould be obtained (the humanndashdisease network and the dis-easendashgene network) giving information about disease comorbid-ities and thereby contextualising known disease genes Bycombining the diseasendashgene network with known PPI networksit was demonstrated that the corresponding disease-associatedproteins for a given disease have a greater tendency to interactin a PPI network Likewise proximity of genes in the networkcan be a strong indicator of their involvement in similar proc-esses Genes associated with a given disease tend to be closertogether in the interactome and overlapping disease neigh-bourhoods are related to greater comorbidity [16]

The network can also be analysed to identify notable topo-logical features many of which have been associated with bio-logically meaningful properties One of such features is themodular organization of the network Modules are groups ofnodes that are more interconnected to other nodes within the

region than to other nodes outside the region [17] In PPI net-works modules have been shown to correspond to functionallysimilar groups of proteins which in some cases may be relevantto disease processes One such example was identified by [18]who using genes from seven stress-related diseases togetherwith the STRING network identified a common subnetworkthat may provide insights into comorbidity of these diseasesAn early example is the network model for asthma [19] thatcombined information from gene co-expression taken from fivepublic gene expression studies with PPI networks and informa-tion from annotation sources The resulting lsquoglobal maprsquo wasthen used to explore common themes in asthma pathogenesis

A powerful example of exploiting the network neighbour-hood of known disease genes to get mechanistic insight is givenin [20] Using a set of 129 asthma implicated (lsquoseedrsquo) genes thatmap to the human interactome together with a novel algorithm[16] for module detection Sharma et al [20] identified a poten-tial asthma disease module containing 441 genes including91 seeds A large number (162) of pathways had at least half oftheir genes in the putative disease module However a strategy

Table 2 List of abbreviations

Abbreviation Expanded name Comment

API Application programming interface A set interfaces tools and functions used for creation of applicationsBEL Biological Expression Language A curation language for structured capturing of data about biological systems and

experimentsBELIEF BEL Information Extraction workFlow

systemA framework for automated parsing of information into BEL format

BioPAX Biological Pathways Exchangelanguage

A standard that formally defines biological pathway conceptualization in OWLformat

BMP Big Mechanism Project A project by US Department of Defense for automated construction of mechanisticcancer models from scientific literature

COPD Chronic Obstructive PulmonaryDisease

A lung disease characterized by impeded breathing and phenotypically similar toasthma

Cypher A query language for Neo4J graph databaseDBMS Database management system A software providing capabilities to mediate storage query and manipulation of

dataEBI European Bioinformatics InstituteEFO Experimental Factor OntologyGO Gene Ontology One of the most widely used ontologies for representing functions of biological

entitiesGWAS Genome-wide association study An observational study that relates germ-line variations between individual to

phenotypesIR Information Retrieval A process of extraction of relevant information from some wider supersetKappa A rule-based declarative language used mostly in molecular biology domainNeo4j Graph-based database solutionOBO Open Biomedical Ontologies

initiativeOBO format is alternative language for authoring ontologies mainly used in biolo-

gical sciencesOWL Ontology Web Language Currently most widely used language for authoring ontologiesPD Process Description language An extension of SBGN standard with additional capabilities to represent temporal

aspects of biological processesPPI Protein-protein interactionRDF Resource Description Format Core data exchange format for the Semantic WebSBGN Systems Biology Graphical Notation A standard for graphical representation for biological and biochemical domainSBML Systems Biology Markup Language An XML-based format for storing and exchanging biological modelsSBO Systems Biology OntologySNP Single Nucleotide Polymorphism A single-base variant or mutation in a genomic sequenceSQL Structured Query Language Family of similar query languages used for data management in relational

databasesTM Text Mining A process of extraction of structured data from free textTMO Translational Medicine OntologyURL Uniform Resource Locator Web address resolving to a resource on the WebXML Extensible Markup Language Popular meta-language for document markup

4 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

involving prioritizing genes on the basis of their asthma rele-vance using gene expression and GWAS information enabled aranking of pathways and subsequent identification of path-ways not traditionally associated with asthma The plausibilityof one of these (the GAB1 signalosome) was shown experimen-tally The authors also explored the differential expression ofthe GAB1 gene in other immune-related diseases This study il-lustrates the use of both pathways and networks with othersupporting evidence (eg gene expression GWAS) to context-ualize disease genes and provide new avenues for understand-ing disease mechanisms It highlights the need for knowledgerepresentations that can capture multiple levels of informationsuch as the conditions in which a gene may be up-regulated

One approach to representation of the complex networks thatare associated with disease is to decompose the information intolayers with each layer representing information at different lev-els of biological organization Radonjic and co-workers [21 22] usenetwork-based methods to represent and integrate informationat three levels namely differentially expressed biological proc-esses transcription factors whose target proteins are differen-tially expressed and physiological information such as energyintake and weight They explored adaptation of white adipose tis-sue to a high-fat diet a biological system that may give clues as towhich molecular processes are involved in obesity and type 2 dia-betes Gustafsson et al [17] discussed the concept of multilayerdisease modules as a possible framework to generate predictivemarkers that capture multiple types of disease-related informa-tion (PPI symptoms etc) The integration of the layers and the in-clusion of addition contextual information associated with thenodes and relationships of the network (eg cell-type specificities)are important for mechanistic insight into disease Network-based data fusion approaches to integration of multiple typesof biological information relevant to disease are described by[23 24] and these offer the potential to predict new disease-asso-ciated genes as well as to explore diseasendashdisease relationships

The network-based integration of different types of informa-tion can suggest underlying mechanisms of disease Huan andco-workers [25] perform an integrated analysis to explore mech-anisms of blood pressure regulations They used weighted geneco-expression network analysis (WGCNA) [26] to identify co-expressed modules that showed correlation to blood pressuremeasurements and integrated SNP data as well as PPI networksand used Bayesian analysis to identify key drivers of the biolo-gical process one of which was selected for further experimen-tal study The identification of associations between diseasescan suggest a common mechanistic basis [27] and can be usedpredictively [28] Molecular network approaches to representingdisease landscapes offer particular views rather than a unifiedframework In some ways they are similar to information over-lay methods in data analytics Points of correspondence be-tween the different views (networks) need to be found to get anintegrated picture that will be useful for suggesting mechanism

To illustrate these contextualization strategies 22 genesfrom the differential expression signature (derived fromGSE43696 study as outlined in Supplementary Methods) weremapped to protein association network from STRING databaseFirst the results were explored using Web-based interactivevisualization offered by STRING database (Figure 2) As proteinsrealize their function through complex sets of interactions im-plications of gene expression changes can often be betterunderstood by assessing its impact on the wider interactomeThe top panel shows disease signature genes as well as 50 genesfrom the first shell identified as most relevant to this input Theanalysis has suggested several high-confidence modules most

of which contained prominent asthma-associated signallingpathways (HIF [29] TGF-beta [30] and cytokinendashcytokine receptorinteraction [31] and circadian clock-related [32]) Several high-degree nodes connecting these multiple modules are alsoknown to be implicated in asthma tyrosine kinase FYN isinvolved in inflammation [33] IL1B shown to be geneticallyassociated with asthma susceptibility [34] and the node withthe highest degree (SMAD4) is involved in airway smoothmuscle hyperplasia [35]

Given that the number of identified human PPIs is now inthe millions visual exploration is typically limited by the sheeramounts of relevant information that can be usefully displayedApproaches like network diffusion and graph random walksoffer a mathematically robust way to select most relevant nodeseven in large graphs These methods are effective for diseaseanalysis as disease-related genes are commonly located closeto each other in the network In this example we have demon-strated the applicability of this key principle and also identified10 most relevant (closest to the signature) genes by the diffusionstate distance metric [36] This analysis highlighted potentialrelevance of the PTEN gene known to be prominently involvedin airway hyperresponsiveness and inflammation [37] as wellas a small module with multiple members of the PI3KAkt path-way that promotes asthma-related airway remodelling [38]

Contextualization by pathway centricrepresentations

Understanding the mechanism of disease often starts with iden-tification of the aberrant molecular pathways that may be asso-ciated with the disease Molecular pathway models aim tocapture sequences of actions or events of molecular lsquoactorsrsquoSome examples of commonly modelled events include biochem-ical reactions state changes and movements between differentcellular compartments Mapping known disease genes to knownpathways is a well-established methodology for contextualiza-tion as pathways capture an intermediate level of organizationbetween molecular entities and phenotype The identification ofhallmark pathways believed to be associated with disease condi-tions has been used to create pathway centric disease maps [39]which are useful for visual exploration and for the overlay ofdata from for example gene expression studies Typically suchapproaches involved identification of disease-related pathwaysby careful manual curation of the literature AlzPathway [40] is acollection of signalling pathways in Alzheimerrsquos disease Themap is represented in the Systems Biology Graphical Notation(SBGN) [41] process description (PD) language and was also madeavailable in other formats The resource also offers a Web inter-face that allows users to review and comment on annotationsthus facilitating community curation Another example of apathway-based disease map is the Parkinsonrsquos disease map [42]which builds on an understanding of the hallmark metabolicregulatory and signalling pathways of the disease [39] TheParkinsonrsquos disease map can be accessed using the Minerva Webserver which has advanced functionality for visualization (usingthe GoogleMaps API) that facilitates manual exploration [43]The Atlas of Cancer Signalling Networks [44] uses a similar ap-proach where several types of biological entities like pheno-type ion and drug can be represented on a zoomable map inSGBN format Clicking on an entity brings up a box with its de-scription and relevant annotations

Ideally a computational framework for contextualizing dis-ease-implicated genes should allow annotation by members of

Navigating the disease landscape | 5

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

the research community and some selected frameworks offerthis important functionality WikiPathways [45] contains a setof open evolving updateable disease-associated pathwaysPathWhiz [46] is a Web-based tool that allows users to drawpathways so as to include contextual information such as cellstissues and organs The pathway maps are designed to be visu-ally appealing to facilitate manual exploration PathWhizincludes metabolic signalling as well as a number of disease-associated pathways Recently the PathWay Collage tool [47]has been developed that allows the visualization of fragmentsof pathways (defined by the user so as to reduce the complexity

of full pathway representations) onto which omics data can beoverlaid

At present there are two leading exchange standardsadopted for representing biochemical pathway networks Thefirst one is Systems Biology Markup Language (SBML) [48] whichwas primarily intended to support development and sharing ofquantitative biochemical models Therefore SBML has been de-signed to be generic with the main building blocks being spe-cies (quantifiable physical entities) processes (which definehow entities are manipulated) and compartments (provide low-level context for processes and entities) As SBML format aims

BLM

CHEK1PLXND1TIPIN

SEMA3ECLSPN

ATRMAP3K6

HNMT

KCNN4MCM2DPPA5

CMCMMMMMMMMMMMMMMMMMMMMMMMMTIMELESSPER1

ARNTL SYT13

CLOCKBMP5CAMK2A GDF5GDGD

SMAD1CRY2 CAMK2D

SMAD4CAMK2G BMPR2MYO19 MPR2

ACVR2BSMAD5CAMK2BGRIN2B

FHOD1 SMAD9GRIN2A

D9D99D9D99D99999999999999999999999999DD9D99999999999999999999999999999

BMPR1ASTAP2 BMP4

CALM2 BMP10BMP7CALM1

BMP2KRT73

PTK6 SMAD6NOG

SMAD7KHDRBS1KHDRBS2 FYN

77777777777777777

BMP6IL1B

IL1A

HSD17B13 IL1R2UTRN

WNK4 IL19IL20RBDAG1

RAPSNNPHS1NNNCD2AP

IL20AGRN

IL20RA

SLCO1B3NPHS2NNN

KIRRELIL22RA1

CPXM1NAT8

ab

c

1

2

3

4

lt 22eminus16

000023

0000000000011

10

20

30

Random

DisGeNET signature

Severe Asthma signature

Severe Asthma vs DisGeNETYWHAQ

FLT3YWHAE ILK

YWHABYWHAG

YWHAZ SFNPTEN

MATK

Diffu

sio

n s

tate

dis

tan

ce

Distances between

members of gene signatures

Ten closest genes to severe

asthma gene expression signature

ILK

FLT3

PTEN

YWHABB

YWHAQYWHAE

MATK

YWHAZ

SFNYWHAG

A

B C

Figure 2 Analysis of severe asthma differential expression gene signature in the context of STRING protein association network (A) Fifty genes in the first shell of the

22-gene signature (andashc) high-degree module interconnectivity genes (1) calmodulins and HIF-1 signalling pathway (2) TGF-beta signalling pathway (3) cytokinendashcyto-

kine receptor interaction pathway and (4) circadian rhythm-related genes (B) Diffusion state distance of gene sets this measure is derived from similarity of random

walk profiles for each pair of nodes In greymdashrandom pairs of nodes Distances between members of asthma signatures to each other are shown in blue (severe asthma

versus DisGeNet category shows distances between members of the two sets) Finally 10 closest genes to the differential expression signature set are shown in green

(C) Ten closest genes to the severe asthma signature visualized in the STRING database viewer

6 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

to be generic further subcategories of these concepts are notdefined as part of the standard and instead particular em-phasis is placed on enabling mathematical definition of themodel (eg in the form of ordinary differential equations) Someof the necessary biological underpinnings are added via arelated project SBGN [41] standard which defines how differentSBML entities can be represented graphically In SBML fine-grained annotation or specification of complex contexts is pos-sible through special fields set aside for user data However asthese fields are not bound by the standard they may not besupported by all of the tools and therefore generally would notbe used in evaluation of the model

The SBGN PD language [41] was designed to give an unam-biguous representation of mechanism of action within a biolo-gical pathway that can be interpreted visually to facilitatemanual exploration The PD language can represent all thesteps in for example a reaction showing how a biological entitychanges in a temporal manner However to fully exploit the PDrepresentation for applications like numerical modelling wouldrequire additional data associated with component steps (suchas rate constants) which often is not readily available

Another prominent data exchange standard is BioPAX [49]which was primarily designed for unambiguous sharing of bio-chemical pathway data and therefore offers much wider selectionof domain-specific terms to characterize and categorize bothinteractors and interactions As Ontology Web Language (OWL)forms the core of this standard all of the terms are organized hier-archically and relationships between them are formally definedBioPAX standard aims to facilitate data sharing and integrationvia Semantic Web technologies and Resource Description Format(RDF) If an RDF representation is used the format can be easilyextended to incorporate complex context information howeverBioPAX currently does not aim to offer the means of defining ex-perimental or biomedical context for pathways as by designother standardsontologies can be brought in to model them

Both SMBL and BioPAX are predicated on the developmentsof the past decade when several key technologies and designprinciples were introduced that greatly influenced how biomed-ical data-sharing standards are implemented One of these de-velopments was increasing adoption of formal ontologies thatallowed development of well-structured controlled vocabulariesfor different domains and definition of cross-mappings betweenthem For example the SBML standard allows annotation ofmodels with System Biology Ontology (SBO) terms which arealso used in SBGN and can be mapped on to BioPax via SBPAX[50] Therefore in principle it is possible to integrate modelsacross all of these standards on a qualitative level On a syntac-tic level different formalisms can be integrated by introducingsupport for a low-level common language like RDF which isrepresentation-agnostic and allows all types of data to be linkedvia common identifiers Both of these solutions enable modernstandards to be designed in a modular way where differentspecialized representations can be developed by different com-munities of domain experts and then combined togethermdashoreven extended for the needs of a particular applied project

In recent years the community efforts have been increas-ingly devoted to increasing inter-compatibility across all mainbiomedical standards These efforts include development ofconverters controlled vocabulary mappings and better supportof different standards both by data providers and analytical tooldevelopers Greater compatibly also means that smaller models(eg from individual studies) can now be meaningfully sharedand combined One prominent resource that aims to facilitatethis process is NDex [51] NDex allows researchers to share their

models in a variety of different formats which can then be useddirectly by using the NDex website like a data warehouse orintegrated into a common representation by using the NDexCytoscape [52] plug-in

Pathway-based representations enable relevant methods toincorporate the directionality and interaction type (eg up- ordown-regulation) and in this way they can capture causal as-sociations underlying different disease mechanisms To illus-trate this application we have used lsquoTied Diffusion throughInteracting Eventsrsquo method [53] which can identify significantlyimplicated pathways connecting two sets of genes (in this caseDisGeNET asthma signature and our severe asthma signature)The diffusion model used by this method can take into accountmagnitude of effects direction and type of interactions Afterapplying this approach the regulatory subnetwork fromReactome found to be significant for these two sets wasvisualized in Cytoscape (Figure 3) It is possible to see that twointerleukin genes (both involved airway inflammation [54])were found to be particularly critical with lower panel showingkey associations involved

Contextualization by knowledgegraph representations

There are many specialist sources of disease-related informa-tion such as databases of mutations and their consequences Inaddition there is vast information in unstructured formats inthe scientific literature and much of it is contextual for ex-ample statements that describe correlations between geneproducts or between small molecules and gene products thatare observed in particular cell types Information in the litera-ture may also describe weak associations only seen in certainconditions or for which there may be limited evidence thatcould nevertheless still be useful for hypothesis generation

The Biological Expression Language (BEL) [55] representsstatements in biology typically correlative and causative state-ments as well as the context in which these statements applyA biological statement is represented as a triple (subject predi-cate and object) where the subject is a BEL term and the predi-cate is a biological relationship that connects the subject withobject A BEL term can be the abundance of a biological entity orit can be a biological process such as one described in the GeneOntology (GO) [56] or even a given disease condition Each en-tity is described in an existing namespace (eg from databaseslike ChEBI [57]) or in particular user-defined namespace(s)There is a fixed set of causal correlative and other relationshipsthat link these entities The BEL complier checks that the state-ments are syntactically correct and carries out the integrationand alignment of the data by establishing equivalences betweenidentical concepts

The knowledge captured by BEL statements can be repre-sented as a graph The object can be a BEL term or a BEL state-ment Each statement can be associated with metadata thatcontextualizes it for example qualifying it to be true only inspecific tissues The BEL knowledge network can facilitate datainterpretation using a reverse causal reasoning approach [58]For example gene expression data can be mapped to smallersubnetworks that represent cause and effect models The extentto which the models can explain the measurements of differen-tially expressed genes can then be assessed and potentiallyyield insight into the mechanism

Hofmann-Apitius [5] posits that an understanding of diseasemechanisms involves integrating information at multiple

Navigating the disease landscape | 7

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

biological scales To some extent BEL offers a multiscale repre-sentation that other languages do not Using BEL Hofmann-Apitius and co-workers have explored mechanistic aspects ofAlzheimerrsquos disease [59ndash62] They used the BEL language to rep-resent information from the literature together with relevantexperimental data sets to build integrated evidence networksThis approach helps to mine what they refer to as the know-ledge lsquogrey zonersquo which includes both established and emerg-ing connections between entities This approach can bepowerful for linking mechanisms associated with biomarkers[60]

Constructing such information-rich representations can in-volve considerable manual curation efforts For this reason TM

will be increasingly important for supporting this process [63]Already existing examples include the BEL InformationExtraction workFlow (BELIEF) system and BELSmile [64] Theseframeworks work by using automated TM to detect BEL con-cepts and relationships and then allowing the curators to ex-plore and annotate their biological contexts to create final high-quality networks

Another example of a specialized curation data modellingformat is the Nanopublication standard [65] Nanopublicationdefines rules for publishing data in its simplest possible for-msmdashindividual assertions in the form of statements composedof a subject predicate and object plus supporting provenanceinformation and relevant metadata Nanopublication is

Figure 3 Significant pathways identified using Tied Diffusion through Interacting Events method an approach that can extract the most likely set of interconnectors

between two sets of nodes In this example core asthma genes from DisGeNET and severe differential expression signature were considered in the context of

Reactome regulatory pathway network In both panels colour intensity indicates weight magnitude assigned by the algorithm (A) Complete set of all significant genes

Dashed lines indicate undirected edges of lsquocomponentrsquo type whereas solid edges show directed interactions (B) subset of the network relating to the IL1RN gene

Arrow and bar-terminated edge styles show activation and inhibition respectively

8 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

designed to be used in combination with Semantic Web tech-nologies and uses RDF a format that allows seamless incorpor-ation of ontology terms and cross-references through globallyunique shared identifiers The advantages of this standard in-clude decentralized (federated) publishing machine-readablerepresentation of data and an ability to track provenance at thehighest possible level of granularity

Knowledge network representations of biological data are in-creasingly being leveraged in in integrative analysis platformswhich aim to combine information resources with commonanalysis and visualization tasks One example of such platformsis Ingenuity Pathway Analysis (IPAVR ) IPAVR suite aims to facili-tate interpretation of experimental data from high-throughputomics experiments and includes methods for identification ofcausal networks and visual exploration of biological graphs [66]The BioMax BioXMTM platform uses an object relational data-base architecture to store query and envision disease informa-tion including molecular genetic physiological and clinicaldata as well as biological models This platform has been usedto create a resource for chronic obstructive pulmonary disease(COPD) called lsquoCOPD knowledge basersquo [67ndash70] The authors ofCOPD knowledge base do acknowledge that much disease-related information remains hidden in the literature and alsorecognize the difficulties of capturing quality and context speci-ficity The Malacards resource [71] uses a relational database forthe semantic integration of multiple sources of informationabout diseases and also provides network visualizations of dis-easendashdisease relationships GeneAnalytics [72] enables context-ualization of functional signatures (gene sets) in the context oforgans and tissues and compartments as well as other annota-tion sources building on the GeneCards [73] and Malacardstechnologies

Recent advances in graph-based databases have opened upadditional possibilities for organizing heterogeneous and inter-linked data Graph databases have several features that makethem particularly attractive for management and exploratoryanalysis of complex biological data sets A particularly notewor-thy feature is excellent performance of traversal-type queriesacross entities of diverse types they offer Such queries could bechallenging to realize in relational databases because of the costof computing statements across different tables Traversalqueries can be an important way of identifying new relation-ships in integrated data particularly in the cases where linksonly become apparent once multiple sources have been inte-grated but may not be immediately obvious because of sheeramounts of data being involved Modern graph databases offersophisticated query languages and often also other types offramework-specific functionality to facilitate applied analysislike an ability to extend the framework with user-defined func-tions [74] and integration with ontologies and reasoning sys-tems One example of a graph database solution that has beenalready used in biomedical domain [75ndash78] is Neo4J DBMSNeo4J is a Java-based solution which also offers its own graphquery language (Cypher) which is conceptually similar to SQL

Semantic Web [79] RDF [80] and Linked Open Data [81] areemerging important standards for sharing biomedical data Atits core Semantic Web is characterized by its use of globallyunique unambiguous identifiers for entities The identifier usedfor this purpose is commonly a web link (URL) The RDF formatis structured as a set of statements with three parts (subjectpredicate and object) where a subject and object are entitieswith a unique identifier and predicate defines the type of a re-lationship between them Linked Open Data standard estab-lished a set of rules for providing meaningful RDF at the

resolvable identifier URL which generally is in the form of state-ments that provide meaningful information about that entitythat can in turn link it to other URL-identified entities and fi-nally the entities and statements can be linked to specializedontologies in OWL [82] that provide additional informationabout how to interpret them within a particular context

From the perspective of biomedical data integration RDFprimarily plays the role of a low-level data exchange standarduseful for addressing the syntactic heterogeneity issue andfacilitating data integration Although RDF representations canbe interpreted as integrated knowledge networks usually anadditional set of standards will be necessary for meaningful in-terpretation of data from a biological perspective Prominent re-sources that offer biological data on the Semantic Web includeEBI RDF platform [83] (covers such important databases likeUniProt [84] Reactome [85] Ensembl [86] ChEMBL [87] andExpression Atlas [88]) Bio2RDF (aims to mirror in RDF most ofthe prominent biological databases) [89] and OpenPhacts (usesSemantic Web technologies to represent the chemogenomicsspace) [90] Some noteworthy ontologies used in combinationwith these resources include Experimental Factor Ontology(EFO) [91] SBO [92] Translational Medicine Ontology (TMO) [93]SNOMED (medical terms and concepts) [94] and GO (functionalannotation of genes and proteins) [56]

To illustrate the application of integrated knowledge net-works with our example 22-gene severe asthma signature wehave used a Neo4j-based Hetionet v10 resource [95] that inte-grates a wide variety of biomedical information First a querywas prepared to visualize all of the diseases linked to the sig-nature genes (Figure 4) According to these results only onegene has already been linked to asthma though three othergenes have been linked to other lung diseases (COPD and idio-pathic pulmonary fibrosis) Next the connections betweenknown asthma-related pathways severe asthma differentialexpression signature and drugs targeting those common path-ways were retrieved In terms of query-specific degree of drugnodes (ie number of links to returned proteins) top threedrugs were dexamethasone betamethasone and niclosamideThe former two are corticosteroids already used for treatingasthma [96] and the last one was recently proposed as a highlypromising treatment because of its strong TMEM16A antagon-ism found to improve lung function in human and mousemodels [97]

Approaches for extraction ofcontextual knowledge

Ultimately all types of models in biomedical domain originatefrom data collected during relevant experiments and such datawould typically require both different levels of preprocessing andinterpretation to yield such models At present relatively fewtypes of raw experimental data can be straightforwardly con-verted into these representations without requiring human inter-pretation Some possible examples are construction of co-expression networks [98 99] from transcriptomic data and se-quence homology graphs [100 101] However currently manualinterpretation and publishing of inferred models as free text inscientific papers remains the primary means of disseminatingthis knowledge and all major biomedical databases discussed inthis review expend considerable curation efforts to collect verifyand consolidate it in standardized formats from source literature

Given the volume and ever-increasing rate of scientific pub-lishing automated means of facilitating this process are

Navigating the disease landscape | 9

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Figure 4 Query expressed in natural language and resulting output produced for the differential expression gene signature using Hetionet graph database The follow-

ing entities are shown proteins (blue) pathways (yellow) diseases (brown) and drugs (red) (A) Query for associated diseases (B) query to explore the connections be-

tween niclosamide drug asthma and differential expression signature linked via relevant proteins and pathways

10 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

becoming increasingly essential This is achieved through appli-cations of information retrieval (IR) and TM methods which canhandle tasks like identification of relevant documents sectionsand sentences recognition of concepts and relationships be-tween them or even generation of novel hypotheses [102]Introduction of computational text analysis methods into thecuration process has been reported to improve its efficiency bybetween 2- and 10-fold [103] Especially in the biomedical com-munity development of TM approaches has greatly benefitedfrom competitive evaluation challenges [104] Some prominentexamples of such initiatives include BioCreative TRECGenomicsMedicalCDS BioNLP-ST i2b2 and ShAReCLEFeHealth These regular competitions provide a valuable bench-mark for current state of the art for different tasks and serve tohighlight the most important problems focus community ef-forts and establish necessary annotated textual resources

As well as assisting with curation both TM and IR can beapplied in a high-throughput and automated manner howeverthe utility of these applications is still limited by accuracy andgeneralizability [103] In its simplest form various statistical co-occurrence methods can be applied to build networks linkingbiological concepts frequently found together in differentpapers Some examples of resources offering such literature-derived data are miRCancer that compiled data about co-occurrence of microRNA mentions with different cancer types[105] and DISEASES resource which provides similar informa-tion for gene mentions with particular diseases [106]

Extraction of information from text is contingent on unam-biguous identification of entities and relationship types fromwhich model statements can be constructed Therefore TMtechnologies have a natural synergy with ontology-based datamodelling and to a lesser extent Semantic Web formats likeRDF [102 107] The ontologies are used both as controlled vocab-ularies to resolve alternative naming conventions and as sour-ces of common identifiers for recovered concepts andrelationships Some examples of specialized relationship typesthat can be recovered include proteinndashprotein [108] drugndashpro-tein [109] and drugndashdrug [110 111] interactions and associ-ations of genes with mutations [112 113] and phenotypes [114ndash116] Data from IR and TM are increasingly incorporated along-side manually reviewed information by major databases for ex-ample PPIs extracted from literature by fully automatedapproaches are offered by STRING [13 117]

To handle more advanced tasks it is usually necessary tocombine specialized tools into TM workflows eg a hypothesisgeneration tool might rely on output of a relationships identifi-cation tool which in turn relies on named entity recognitionInevitably this raises questions of interoperability of differentanalysis methods One way to address this is has been throughincreasing use of Web services to link up different tools [118]Additionally specialized standards for biomedical textual datainterchange were developed for example XML-based BioCformat [119]

Recent efforts have been exploring the possibility of extract-ing more complex statements from text [120] and in 2015 theBioCreative V challenge has for the first time included a task ofautomated extraction of OpenBEL statements from text [121]During the competition the best result was achieved by theBELMiner framework [122] which used a rule-based approachfor extraction of BEL statements from text and got the highestF-measure of 20 for this task After the challenge BELSmilesystem was developed that reached an even better score of 28on the same dataset [64] BELSmile implemented a semantic-role-labelling approach which relies on recognition of verbs

(predicates) and assignment of roles to associated subjectsob-jects relative to that verb Notably designs of both frameworksintegrated multiple specialized IR and TM tools to handle spe-cific low-level subtasks

It is clear that at present a substantial proportion of valuablebiomedical knowledge is lsquotrappedrsquo in scientific text in a repre-sentation not readily suitable for automated computer-driveninterpretation Appropriate data formats and ontologies are anessential prerequisite for enabling sophisticated TM methods tobegin addressing this challenge The most recent developmentsindicate that current TM approaches are becoming powerfulenough to leverage such rich representations and are alreadyuseful to curators working on complex models of human dis-eases To explore further developments in this important areawe direct reader to the following recent reviews [25 102 123]

Outlook

Many translational and systems medicine projects involve col-lection of molecular and clinical data with a view to the identifi-cation of disease subtypes The data are typically warehousedand analysed for example to extract molecular fingerprintsthat characterize the subtypes and this data interpretation stepof the translational informatics pipeline remains highly chal-lenging Moving from diagnostic molecular patterns to beingable to suggest individualized healthcare pathways is complexData integration links the molecular patterns to backgroundknowledge which can then be reviewed explored and analysedto develop a more detailed understanding of the underlyingbiology that distinguishes disease subtypes

To support this process several standards have emerged tocapture integrate and facilitate analysis of these data This pro-liferation of standards suggests that different formalisms willcontinue to be necessary to model different aspects of biomed-ical data and that we are unlikely to converge on one ultimatemodelling solution in the foreseeable future The open questiontherefore becomes how to effectively manage the interoperabil-ity between these different representations In this respect sev-eral common trends and best practices are beginning to emergeIn particular it is now evident that development of commoncontrolled vocabularies and specialized domain ontologies is es-sential for effective management of increasingly complex bio-medical data Use of ontologies facilitates identification ofequivalent entities and concepts across different modelling for-mats and facilitates conversion between them allowing effi-cient data integration

As outlined by Cohen in [6] modern biomedical data model-ling formalisms can be categorized into lsquocurationrsquo and lsquomechanis-ticrsquo types Aim of the former is to primarily describe experimentalresults (eg Nanopublications BioPAX) including observed asso-ciations between molecular signatures and diseases The latterspecify evaluable models used to simulate living systems andgenerate predictions based on a set of inputs (eg SBML Kappa)It is obvious that although experimental observations are ultim-ately used to construct such models the processing stepsrequired to correctly generate the latter from the former wouldgo beyond a simple format change We believe that automationof this process is a highly promising emergent strategy for devel-opment of fully mechanistic models of disease Use of automatedmethods to support biomedical innovation is necessitated bothby the need to effectively deal with ever-growing volumes of dataand by the increasingly complex nature of biomedical researchitself The apparent diminishing returns in terms of noveldrugs successfully taken to market relative to the money spent

Navigating the disease landscape | 11

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

[124 125] with particular challenges for cardiovascular [126] andneurodegenerative diseases [127 128] mean that increasinglymore complex strategies for managing disease are needed tomake progress [129] Consequently high-throughput computa-tional approaches are becoming indispensable both for assistingwith interpretation management and mining of experimentaldata as well as for constructing and evaluating relevant clinicalmodels At present cutting-edge efforts in this area involve cre-ation of increasingly sophisticated automated methods for ana-lysis of vast quantities biomedical data exemplified by initiativeslike DARPArsquos Big Mechanism IBM Watson Health and Garudabiomedical discovery platforms The future modelling formatsare therefore likely to be influenced by the requirements of suchprojects and will increasingly incorporate features to supportfully computer-driven information analysis

In our view accurate handling of the context of biological ob-servations is critical for construction of correct mechanistic mod-els of disease Traditionally information has been compiled intolargely homogenized collections like pathway maps or globalinteractome networks that aim to show all possible processes of acertain type for a given organism However in practice only asmall subset of all these processes will be occurring in a real cellat a given time Furthermore such events may be transient (en-zymatic reactions) have stable outcome (protein complex forma-tion) or be mutually exclusive (alternative protein-bindingpartners) An additional level of complexity is introduced fromthe imperfection of experimental techniques For example tran-scriptomics profiling is often done at a tissue level which maymask important differences at a level of individual cells As suchmechanistic models built from these data may be subject to con-siderable uncertainty and abstraction The next generation of for-mats and integration solutions will therefore need to be morecontext aware Future progress on one hand requires bettermeans of expressing this highly granular context of biologicalprocesses and on the other necessitates solutions to model vastvolumes of data and knowledge in such form Once efficient andaccurate approaches for these tasks are established cutting-edgemathematical and machine learning methods could be leveragedto their full potential to give new insight into disease mechan-isms and suggest improved avenues for therapeutic intervention

Key Points

bull Precision medicine relies on identification of disease sub-types at a molecular level and linking them to diagnosticmodels to determine optimal treatment strategies

bull In practice establishing a contextual link between molecularsignatures and disease processes requires integration ofqualitatively diverse types of relevant data and knowledge

bull At present several alternative philosophies guide theprocess of transformation of experimental data intomechanistically informative and ultimately clinicallyactionable insights

bull The spectrum of possible approaches ranges frompurely associative knowledge link discovery to fullyquantitative mathematical models

Supplementary Data

Supplementary data are available online at httpsacademicoupcombib

Funding

The research leading to these results has received supportfrom the Innovative Medicines Initiative Joint UndertakingeTRIKS Project (Grant Number IMI 115446) resources ofwhich are composed of a financial contribution from theEuropean Unionrsquos Seventh Framework Programme (FP72007-2013) and European Federation of PharmaceuticalIndustries and Associations (EFPIA) companiesrsquo in kind con-tributions The work was also partially supported by CoreResearch for Evolutional Science and Technology (CREST)Grant from the Japan Science and Technology Agency(Grant Number JPMJCR1412) and Japan Society for thePromotion of Science KAKENHI (Grant Numbers 17H06307and 17H06299)

References1 Robinson PN Deep phenotyping for precision medicine

Hum Mutat 201233(5)777ndash802 Glaab E Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen classificationBrief Bioinform 201617(3)440ndash52

3 Woodruff PG Boushey HA Dolganov GM et al Genome-wide profiling identifies epithelial cell genes associated withasthma and with treatment response to corticosteroids ProcNatl Acad Sci USA 2007104(40)15858ndash63

4 Kaneko Y Yatagai Y Yamada H et al The search for com-mon pathways underlying asthma and COPD Int J ChronObstruct Pulmon Dis 2013865ndash78

5 Hofmann-Apitius M Ball G Gebel S et al Bioinformaticsmining and modeling methods for the identification of dis-ease mechanisms in neurodegenerative disorders Int J MolSci 201516(12)29179ndash206

6 Cohen PR DARPArsquos big mechanism program Phys Biol 201512(4)045008

7 Barabasi AL Network Science Cambridge United KingdomCambridge University Press 2016

8 Le Novere N Quantitative and logic modelling of molecularand gene networks Nat Rev Genet 201516(3)146ndash58

9 Boccaletti S Bianconi G Criado R et al The structure and dy-namics of multilayer networks Phys Rep 2014544(1)1ndash122

10 Vidal M Cusick ME Barabasi AL Interactome networks andhuman disease Cell 2011144(6)986ndash98

11 Pi~nero J Queralt-Rosinach N Bravo A et al DisGeNET a dis-covery platform for the dynamical exploration of humandiseases and their genes Database 20152015bav028

12 Voraphani N Gladwin M Contreras A et al An airway epi-thelial iNOSndashDUOX2ndashthyroid peroxidase metabolome drivesTh1Th2 nitrative stress in human severe asthma MucosalImmunol 20147(5)1175ndash85

13 Franceschini A Szklarczyk D Frankild S et al STRING v9 1protein-protein interaction networks with increasedcoverage and integration Nucleic Acids Res 201241(D1)D808ndash15

14 Montojo J Zuberi K Rodriguez H et al GeneMANIA fastgene network construction and function prediction forCytoscape F1000Res 20143153

15 Goh KI Cusick ME Valle D et al The human disease net-work Proc Natl Acad Sci USA 2007104(21)8685ndash90

16 Menche J Sharma A Kitsak M et al Uncovering disease-disease relationships through the incomplete interactomeScience 2015347(6224)1257601

12 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 4: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

development of disease One prominent early example of thiswas from the work of Barabasi and co-workers [15] which rep-resented diseasendashgene relationships as a bipartite graph fromwhich monopartite diseasendashdisease and genendashgene graphscould be obtained (the humanndashdisease network and the dis-easendashgene network) giving information about disease comorbid-ities and thereby contextualising known disease genes Bycombining the diseasendashgene network with known PPI networksit was demonstrated that the corresponding disease-associatedproteins for a given disease have a greater tendency to interactin a PPI network Likewise proximity of genes in the networkcan be a strong indicator of their involvement in similar proc-esses Genes associated with a given disease tend to be closertogether in the interactome and overlapping disease neigh-bourhoods are related to greater comorbidity [16]

The network can also be analysed to identify notable topo-logical features many of which have been associated with bio-logically meaningful properties One of such features is themodular organization of the network Modules are groups ofnodes that are more interconnected to other nodes within the

region than to other nodes outside the region [17] In PPI net-works modules have been shown to correspond to functionallysimilar groups of proteins which in some cases may be relevantto disease processes One such example was identified by [18]who using genes from seven stress-related diseases togetherwith the STRING network identified a common subnetworkthat may provide insights into comorbidity of these diseasesAn early example is the network model for asthma [19] thatcombined information from gene co-expression taken from fivepublic gene expression studies with PPI networks and informa-tion from annotation sources The resulting lsquoglobal maprsquo wasthen used to explore common themes in asthma pathogenesis

A powerful example of exploiting the network neighbour-hood of known disease genes to get mechanistic insight is givenin [20] Using a set of 129 asthma implicated (lsquoseedrsquo) genes thatmap to the human interactome together with a novel algorithm[16] for module detection Sharma et al [20] identified a poten-tial asthma disease module containing 441 genes including91 seeds A large number (162) of pathways had at least half oftheir genes in the putative disease module However a strategy

Table 2 List of abbreviations

Abbreviation Expanded name Comment

API Application programming interface A set interfaces tools and functions used for creation of applicationsBEL Biological Expression Language A curation language for structured capturing of data about biological systems and

experimentsBELIEF BEL Information Extraction workFlow

systemA framework for automated parsing of information into BEL format

BioPAX Biological Pathways Exchangelanguage

A standard that formally defines biological pathway conceptualization in OWLformat

BMP Big Mechanism Project A project by US Department of Defense for automated construction of mechanisticcancer models from scientific literature

COPD Chronic Obstructive PulmonaryDisease

A lung disease characterized by impeded breathing and phenotypically similar toasthma

Cypher A query language for Neo4J graph databaseDBMS Database management system A software providing capabilities to mediate storage query and manipulation of

dataEBI European Bioinformatics InstituteEFO Experimental Factor OntologyGO Gene Ontology One of the most widely used ontologies for representing functions of biological

entitiesGWAS Genome-wide association study An observational study that relates germ-line variations between individual to

phenotypesIR Information Retrieval A process of extraction of relevant information from some wider supersetKappa A rule-based declarative language used mostly in molecular biology domainNeo4j Graph-based database solutionOBO Open Biomedical Ontologies

initiativeOBO format is alternative language for authoring ontologies mainly used in biolo-

gical sciencesOWL Ontology Web Language Currently most widely used language for authoring ontologiesPD Process Description language An extension of SBGN standard with additional capabilities to represent temporal

aspects of biological processesPPI Protein-protein interactionRDF Resource Description Format Core data exchange format for the Semantic WebSBGN Systems Biology Graphical Notation A standard for graphical representation for biological and biochemical domainSBML Systems Biology Markup Language An XML-based format for storing and exchanging biological modelsSBO Systems Biology OntologySNP Single Nucleotide Polymorphism A single-base variant or mutation in a genomic sequenceSQL Structured Query Language Family of similar query languages used for data management in relational

databasesTM Text Mining A process of extraction of structured data from free textTMO Translational Medicine OntologyURL Uniform Resource Locator Web address resolving to a resource on the WebXML Extensible Markup Language Popular meta-language for document markup

4 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

involving prioritizing genes on the basis of their asthma rele-vance using gene expression and GWAS information enabled aranking of pathways and subsequent identification of path-ways not traditionally associated with asthma The plausibilityof one of these (the GAB1 signalosome) was shown experimen-tally The authors also explored the differential expression ofthe GAB1 gene in other immune-related diseases This study il-lustrates the use of both pathways and networks with othersupporting evidence (eg gene expression GWAS) to context-ualize disease genes and provide new avenues for understand-ing disease mechanisms It highlights the need for knowledgerepresentations that can capture multiple levels of informationsuch as the conditions in which a gene may be up-regulated

One approach to representation of the complex networks thatare associated with disease is to decompose the information intolayers with each layer representing information at different lev-els of biological organization Radonjic and co-workers [21 22] usenetwork-based methods to represent and integrate informationat three levels namely differentially expressed biological proc-esses transcription factors whose target proteins are differen-tially expressed and physiological information such as energyintake and weight They explored adaptation of white adipose tis-sue to a high-fat diet a biological system that may give clues as towhich molecular processes are involved in obesity and type 2 dia-betes Gustafsson et al [17] discussed the concept of multilayerdisease modules as a possible framework to generate predictivemarkers that capture multiple types of disease-related informa-tion (PPI symptoms etc) The integration of the layers and the in-clusion of addition contextual information associated with thenodes and relationships of the network (eg cell-type specificities)are important for mechanistic insight into disease Network-based data fusion approaches to integration of multiple typesof biological information relevant to disease are described by[23 24] and these offer the potential to predict new disease-asso-ciated genes as well as to explore diseasendashdisease relationships

The network-based integration of different types of informa-tion can suggest underlying mechanisms of disease Huan andco-workers [25] perform an integrated analysis to explore mech-anisms of blood pressure regulations They used weighted geneco-expression network analysis (WGCNA) [26] to identify co-expressed modules that showed correlation to blood pressuremeasurements and integrated SNP data as well as PPI networksand used Bayesian analysis to identify key drivers of the biolo-gical process one of which was selected for further experimen-tal study The identification of associations between diseasescan suggest a common mechanistic basis [27] and can be usedpredictively [28] Molecular network approaches to representingdisease landscapes offer particular views rather than a unifiedframework In some ways they are similar to information over-lay methods in data analytics Points of correspondence be-tween the different views (networks) need to be found to get anintegrated picture that will be useful for suggesting mechanism

To illustrate these contextualization strategies 22 genesfrom the differential expression signature (derived fromGSE43696 study as outlined in Supplementary Methods) weremapped to protein association network from STRING databaseFirst the results were explored using Web-based interactivevisualization offered by STRING database (Figure 2) As proteinsrealize their function through complex sets of interactions im-plications of gene expression changes can often be betterunderstood by assessing its impact on the wider interactomeThe top panel shows disease signature genes as well as 50 genesfrom the first shell identified as most relevant to this input Theanalysis has suggested several high-confidence modules most

of which contained prominent asthma-associated signallingpathways (HIF [29] TGF-beta [30] and cytokinendashcytokine receptorinteraction [31] and circadian clock-related [32]) Several high-degree nodes connecting these multiple modules are alsoknown to be implicated in asthma tyrosine kinase FYN isinvolved in inflammation [33] IL1B shown to be geneticallyassociated with asthma susceptibility [34] and the node withthe highest degree (SMAD4) is involved in airway smoothmuscle hyperplasia [35]

Given that the number of identified human PPIs is now inthe millions visual exploration is typically limited by the sheeramounts of relevant information that can be usefully displayedApproaches like network diffusion and graph random walksoffer a mathematically robust way to select most relevant nodeseven in large graphs These methods are effective for diseaseanalysis as disease-related genes are commonly located closeto each other in the network In this example we have demon-strated the applicability of this key principle and also identified10 most relevant (closest to the signature) genes by the diffusionstate distance metric [36] This analysis highlighted potentialrelevance of the PTEN gene known to be prominently involvedin airway hyperresponsiveness and inflammation [37] as wellas a small module with multiple members of the PI3KAkt path-way that promotes asthma-related airway remodelling [38]

Contextualization by pathway centricrepresentations

Understanding the mechanism of disease often starts with iden-tification of the aberrant molecular pathways that may be asso-ciated with the disease Molecular pathway models aim tocapture sequences of actions or events of molecular lsquoactorsrsquoSome examples of commonly modelled events include biochem-ical reactions state changes and movements between differentcellular compartments Mapping known disease genes to knownpathways is a well-established methodology for contextualiza-tion as pathways capture an intermediate level of organizationbetween molecular entities and phenotype The identification ofhallmark pathways believed to be associated with disease condi-tions has been used to create pathway centric disease maps [39]which are useful for visual exploration and for the overlay ofdata from for example gene expression studies Typically suchapproaches involved identification of disease-related pathwaysby careful manual curation of the literature AlzPathway [40] is acollection of signalling pathways in Alzheimerrsquos disease Themap is represented in the Systems Biology Graphical Notation(SBGN) [41] process description (PD) language and was also madeavailable in other formats The resource also offers a Web inter-face that allows users to review and comment on annotationsthus facilitating community curation Another example of apathway-based disease map is the Parkinsonrsquos disease map [42]which builds on an understanding of the hallmark metabolicregulatory and signalling pathways of the disease [39] TheParkinsonrsquos disease map can be accessed using the Minerva Webserver which has advanced functionality for visualization (usingthe GoogleMaps API) that facilitates manual exploration [43]The Atlas of Cancer Signalling Networks [44] uses a similar ap-proach where several types of biological entities like pheno-type ion and drug can be represented on a zoomable map inSGBN format Clicking on an entity brings up a box with its de-scription and relevant annotations

Ideally a computational framework for contextualizing dis-ease-implicated genes should allow annotation by members of

Navigating the disease landscape | 5

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

the research community and some selected frameworks offerthis important functionality WikiPathways [45] contains a setof open evolving updateable disease-associated pathwaysPathWhiz [46] is a Web-based tool that allows users to drawpathways so as to include contextual information such as cellstissues and organs The pathway maps are designed to be visu-ally appealing to facilitate manual exploration PathWhizincludes metabolic signalling as well as a number of disease-associated pathways Recently the PathWay Collage tool [47]has been developed that allows the visualization of fragmentsof pathways (defined by the user so as to reduce the complexity

of full pathway representations) onto which omics data can beoverlaid

At present there are two leading exchange standardsadopted for representing biochemical pathway networks Thefirst one is Systems Biology Markup Language (SBML) [48] whichwas primarily intended to support development and sharing ofquantitative biochemical models Therefore SBML has been de-signed to be generic with the main building blocks being spe-cies (quantifiable physical entities) processes (which definehow entities are manipulated) and compartments (provide low-level context for processes and entities) As SBML format aims

BLM

CHEK1PLXND1TIPIN

SEMA3ECLSPN

ATRMAP3K6

HNMT

KCNN4MCM2DPPA5

CMCMMMMMMMMMMMMMMMMMMMMMMMMTIMELESSPER1

ARNTL SYT13

CLOCKBMP5CAMK2A GDF5GDGD

SMAD1CRY2 CAMK2D

SMAD4CAMK2G BMPR2MYO19 MPR2

ACVR2BSMAD5CAMK2BGRIN2B

FHOD1 SMAD9GRIN2A

D9D99D9D99D99999999999999999999999999DD9D99999999999999999999999999999

BMPR1ASTAP2 BMP4

CALM2 BMP10BMP7CALM1

BMP2KRT73

PTK6 SMAD6NOG

SMAD7KHDRBS1KHDRBS2 FYN

77777777777777777

BMP6IL1B

IL1A

HSD17B13 IL1R2UTRN

WNK4 IL19IL20RBDAG1

RAPSNNPHS1NNNCD2AP

IL20AGRN

IL20RA

SLCO1B3NPHS2NNN

KIRRELIL22RA1

CPXM1NAT8

ab

c

1

2

3

4

lt 22eminus16

000023

0000000000011

10

20

30

Random

DisGeNET signature

Severe Asthma signature

Severe Asthma vs DisGeNETYWHAQ

FLT3YWHAE ILK

YWHABYWHAG

YWHAZ SFNPTEN

MATK

Diffu

sio

n s

tate

dis

tan

ce

Distances between

members of gene signatures

Ten closest genes to severe

asthma gene expression signature

ILK

FLT3

PTEN

YWHABB

YWHAQYWHAE

MATK

YWHAZ

SFNYWHAG

A

B C

Figure 2 Analysis of severe asthma differential expression gene signature in the context of STRING protein association network (A) Fifty genes in the first shell of the

22-gene signature (andashc) high-degree module interconnectivity genes (1) calmodulins and HIF-1 signalling pathway (2) TGF-beta signalling pathway (3) cytokinendashcyto-

kine receptor interaction pathway and (4) circadian rhythm-related genes (B) Diffusion state distance of gene sets this measure is derived from similarity of random

walk profiles for each pair of nodes In greymdashrandom pairs of nodes Distances between members of asthma signatures to each other are shown in blue (severe asthma

versus DisGeNet category shows distances between members of the two sets) Finally 10 closest genes to the differential expression signature set are shown in green

(C) Ten closest genes to the severe asthma signature visualized in the STRING database viewer

6 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

to be generic further subcategories of these concepts are notdefined as part of the standard and instead particular em-phasis is placed on enabling mathematical definition of themodel (eg in the form of ordinary differential equations) Someof the necessary biological underpinnings are added via arelated project SBGN [41] standard which defines how differentSBML entities can be represented graphically In SBML fine-grained annotation or specification of complex contexts is pos-sible through special fields set aside for user data However asthese fields are not bound by the standard they may not besupported by all of the tools and therefore generally would notbe used in evaluation of the model

The SBGN PD language [41] was designed to give an unam-biguous representation of mechanism of action within a biolo-gical pathway that can be interpreted visually to facilitatemanual exploration The PD language can represent all thesteps in for example a reaction showing how a biological entitychanges in a temporal manner However to fully exploit the PDrepresentation for applications like numerical modelling wouldrequire additional data associated with component steps (suchas rate constants) which often is not readily available

Another prominent data exchange standard is BioPAX [49]which was primarily designed for unambiguous sharing of bio-chemical pathway data and therefore offers much wider selectionof domain-specific terms to characterize and categorize bothinteractors and interactions As Ontology Web Language (OWL)forms the core of this standard all of the terms are organized hier-archically and relationships between them are formally definedBioPAX standard aims to facilitate data sharing and integrationvia Semantic Web technologies and Resource Description Format(RDF) If an RDF representation is used the format can be easilyextended to incorporate complex context information howeverBioPAX currently does not aim to offer the means of defining ex-perimental or biomedical context for pathways as by designother standardsontologies can be brought in to model them

Both SMBL and BioPAX are predicated on the developmentsof the past decade when several key technologies and designprinciples were introduced that greatly influenced how biomed-ical data-sharing standards are implemented One of these de-velopments was increasing adoption of formal ontologies thatallowed development of well-structured controlled vocabulariesfor different domains and definition of cross-mappings betweenthem For example the SBML standard allows annotation ofmodels with System Biology Ontology (SBO) terms which arealso used in SBGN and can be mapped on to BioPax via SBPAX[50] Therefore in principle it is possible to integrate modelsacross all of these standards on a qualitative level On a syntac-tic level different formalisms can be integrated by introducingsupport for a low-level common language like RDF which isrepresentation-agnostic and allows all types of data to be linkedvia common identifiers Both of these solutions enable modernstandards to be designed in a modular way where differentspecialized representations can be developed by different com-munities of domain experts and then combined togethermdashoreven extended for the needs of a particular applied project

In recent years the community efforts have been increas-ingly devoted to increasing inter-compatibility across all mainbiomedical standards These efforts include development ofconverters controlled vocabulary mappings and better supportof different standards both by data providers and analytical tooldevelopers Greater compatibly also means that smaller models(eg from individual studies) can now be meaningfully sharedand combined One prominent resource that aims to facilitatethis process is NDex [51] NDex allows researchers to share their

models in a variety of different formats which can then be useddirectly by using the NDex website like a data warehouse orintegrated into a common representation by using the NDexCytoscape [52] plug-in

Pathway-based representations enable relevant methods toincorporate the directionality and interaction type (eg up- ordown-regulation) and in this way they can capture causal as-sociations underlying different disease mechanisms To illus-trate this application we have used lsquoTied Diffusion throughInteracting Eventsrsquo method [53] which can identify significantlyimplicated pathways connecting two sets of genes (in this caseDisGeNET asthma signature and our severe asthma signature)The diffusion model used by this method can take into accountmagnitude of effects direction and type of interactions Afterapplying this approach the regulatory subnetwork fromReactome found to be significant for these two sets wasvisualized in Cytoscape (Figure 3) It is possible to see that twointerleukin genes (both involved airway inflammation [54])were found to be particularly critical with lower panel showingkey associations involved

Contextualization by knowledgegraph representations

There are many specialist sources of disease-related informa-tion such as databases of mutations and their consequences Inaddition there is vast information in unstructured formats inthe scientific literature and much of it is contextual for ex-ample statements that describe correlations between geneproducts or between small molecules and gene products thatare observed in particular cell types Information in the litera-ture may also describe weak associations only seen in certainconditions or for which there may be limited evidence thatcould nevertheless still be useful for hypothesis generation

The Biological Expression Language (BEL) [55] representsstatements in biology typically correlative and causative state-ments as well as the context in which these statements applyA biological statement is represented as a triple (subject predi-cate and object) where the subject is a BEL term and the predi-cate is a biological relationship that connects the subject withobject A BEL term can be the abundance of a biological entity orit can be a biological process such as one described in the GeneOntology (GO) [56] or even a given disease condition Each en-tity is described in an existing namespace (eg from databaseslike ChEBI [57]) or in particular user-defined namespace(s)There is a fixed set of causal correlative and other relationshipsthat link these entities The BEL complier checks that the state-ments are syntactically correct and carries out the integrationand alignment of the data by establishing equivalences betweenidentical concepts

The knowledge captured by BEL statements can be repre-sented as a graph The object can be a BEL term or a BEL state-ment Each statement can be associated with metadata thatcontextualizes it for example qualifying it to be true only inspecific tissues The BEL knowledge network can facilitate datainterpretation using a reverse causal reasoning approach [58]For example gene expression data can be mapped to smallersubnetworks that represent cause and effect models The extentto which the models can explain the measurements of differen-tially expressed genes can then be assessed and potentiallyyield insight into the mechanism

Hofmann-Apitius [5] posits that an understanding of diseasemechanisms involves integrating information at multiple

Navigating the disease landscape | 7

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

biological scales To some extent BEL offers a multiscale repre-sentation that other languages do not Using BEL Hofmann-Apitius and co-workers have explored mechanistic aspects ofAlzheimerrsquos disease [59ndash62] They used the BEL language to rep-resent information from the literature together with relevantexperimental data sets to build integrated evidence networksThis approach helps to mine what they refer to as the know-ledge lsquogrey zonersquo which includes both established and emerg-ing connections between entities This approach can bepowerful for linking mechanisms associated with biomarkers[60]

Constructing such information-rich representations can in-volve considerable manual curation efforts For this reason TM

will be increasingly important for supporting this process [63]Already existing examples include the BEL InformationExtraction workFlow (BELIEF) system and BELSmile [64] Theseframeworks work by using automated TM to detect BEL con-cepts and relationships and then allowing the curators to ex-plore and annotate their biological contexts to create final high-quality networks

Another example of a specialized curation data modellingformat is the Nanopublication standard [65] Nanopublicationdefines rules for publishing data in its simplest possible for-msmdashindividual assertions in the form of statements composedof a subject predicate and object plus supporting provenanceinformation and relevant metadata Nanopublication is

Figure 3 Significant pathways identified using Tied Diffusion through Interacting Events method an approach that can extract the most likely set of interconnectors

between two sets of nodes In this example core asthma genes from DisGeNET and severe differential expression signature were considered in the context of

Reactome regulatory pathway network In both panels colour intensity indicates weight magnitude assigned by the algorithm (A) Complete set of all significant genes

Dashed lines indicate undirected edges of lsquocomponentrsquo type whereas solid edges show directed interactions (B) subset of the network relating to the IL1RN gene

Arrow and bar-terminated edge styles show activation and inhibition respectively

8 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

designed to be used in combination with Semantic Web tech-nologies and uses RDF a format that allows seamless incorpor-ation of ontology terms and cross-references through globallyunique shared identifiers The advantages of this standard in-clude decentralized (federated) publishing machine-readablerepresentation of data and an ability to track provenance at thehighest possible level of granularity

Knowledge network representations of biological data are in-creasingly being leveraged in in integrative analysis platformswhich aim to combine information resources with commonanalysis and visualization tasks One example of such platformsis Ingenuity Pathway Analysis (IPAVR ) IPAVR suite aims to facili-tate interpretation of experimental data from high-throughputomics experiments and includes methods for identification ofcausal networks and visual exploration of biological graphs [66]The BioMax BioXMTM platform uses an object relational data-base architecture to store query and envision disease informa-tion including molecular genetic physiological and clinicaldata as well as biological models This platform has been usedto create a resource for chronic obstructive pulmonary disease(COPD) called lsquoCOPD knowledge basersquo [67ndash70] The authors ofCOPD knowledge base do acknowledge that much disease-related information remains hidden in the literature and alsorecognize the difficulties of capturing quality and context speci-ficity The Malacards resource [71] uses a relational database forthe semantic integration of multiple sources of informationabout diseases and also provides network visualizations of dis-easendashdisease relationships GeneAnalytics [72] enables context-ualization of functional signatures (gene sets) in the context oforgans and tissues and compartments as well as other annota-tion sources building on the GeneCards [73] and Malacardstechnologies

Recent advances in graph-based databases have opened upadditional possibilities for organizing heterogeneous and inter-linked data Graph databases have several features that makethem particularly attractive for management and exploratoryanalysis of complex biological data sets A particularly notewor-thy feature is excellent performance of traversal-type queriesacross entities of diverse types they offer Such queries could bechallenging to realize in relational databases because of the costof computing statements across different tables Traversalqueries can be an important way of identifying new relation-ships in integrated data particularly in the cases where linksonly become apparent once multiple sources have been inte-grated but may not be immediately obvious because of sheeramounts of data being involved Modern graph databases offersophisticated query languages and often also other types offramework-specific functionality to facilitate applied analysislike an ability to extend the framework with user-defined func-tions [74] and integration with ontologies and reasoning sys-tems One example of a graph database solution that has beenalready used in biomedical domain [75ndash78] is Neo4J DBMSNeo4J is a Java-based solution which also offers its own graphquery language (Cypher) which is conceptually similar to SQL

Semantic Web [79] RDF [80] and Linked Open Data [81] areemerging important standards for sharing biomedical data Atits core Semantic Web is characterized by its use of globallyunique unambiguous identifiers for entities The identifier usedfor this purpose is commonly a web link (URL) The RDF formatis structured as a set of statements with three parts (subjectpredicate and object) where a subject and object are entitieswith a unique identifier and predicate defines the type of a re-lationship between them Linked Open Data standard estab-lished a set of rules for providing meaningful RDF at the

resolvable identifier URL which generally is in the form of state-ments that provide meaningful information about that entitythat can in turn link it to other URL-identified entities and fi-nally the entities and statements can be linked to specializedontologies in OWL [82] that provide additional informationabout how to interpret them within a particular context

From the perspective of biomedical data integration RDFprimarily plays the role of a low-level data exchange standarduseful for addressing the syntactic heterogeneity issue andfacilitating data integration Although RDF representations canbe interpreted as integrated knowledge networks usually anadditional set of standards will be necessary for meaningful in-terpretation of data from a biological perspective Prominent re-sources that offer biological data on the Semantic Web includeEBI RDF platform [83] (covers such important databases likeUniProt [84] Reactome [85] Ensembl [86] ChEMBL [87] andExpression Atlas [88]) Bio2RDF (aims to mirror in RDF most ofthe prominent biological databases) [89] and OpenPhacts (usesSemantic Web technologies to represent the chemogenomicsspace) [90] Some noteworthy ontologies used in combinationwith these resources include Experimental Factor Ontology(EFO) [91] SBO [92] Translational Medicine Ontology (TMO) [93]SNOMED (medical terms and concepts) [94] and GO (functionalannotation of genes and proteins) [56]

To illustrate the application of integrated knowledge net-works with our example 22-gene severe asthma signature wehave used a Neo4j-based Hetionet v10 resource [95] that inte-grates a wide variety of biomedical information First a querywas prepared to visualize all of the diseases linked to the sig-nature genes (Figure 4) According to these results only onegene has already been linked to asthma though three othergenes have been linked to other lung diseases (COPD and idio-pathic pulmonary fibrosis) Next the connections betweenknown asthma-related pathways severe asthma differentialexpression signature and drugs targeting those common path-ways were retrieved In terms of query-specific degree of drugnodes (ie number of links to returned proteins) top threedrugs were dexamethasone betamethasone and niclosamideThe former two are corticosteroids already used for treatingasthma [96] and the last one was recently proposed as a highlypromising treatment because of its strong TMEM16A antagon-ism found to improve lung function in human and mousemodels [97]

Approaches for extraction ofcontextual knowledge

Ultimately all types of models in biomedical domain originatefrom data collected during relevant experiments and such datawould typically require both different levels of preprocessing andinterpretation to yield such models At present relatively fewtypes of raw experimental data can be straightforwardly con-verted into these representations without requiring human inter-pretation Some possible examples are construction of co-expression networks [98 99] from transcriptomic data and se-quence homology graphs [100 101] However currently manualinterpretation and publishing of inferred models as free text inscientific papers remains the primary means of disseminatingthis knowledge and all major biomedical databases discussed inthis review expend considerable curation efforts to collect verifyand consolidate it in standardized formats from source literature

Given the volume and ever-increasing rate of scientific pub-lishing automated means of facilitating this process are

Navigating the disease landscape | 9

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Figure 4 Query expressed in natural language and resulting output produced for the differential expression gene signature using Hetionet graph database The follow-

ing entities are shown proteins (blue) pathways (yellow) diseases (brown) and drugs (red) (A) Query for associated diseases (B) query to explore the connections be-

tween niclosamide drug asthma and differential expression signature linked via relevant proteins and pathways

10 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

becoming increasingly essential This is achieved through appli-cations of information retrieval (IR) and TM methods which canhandle tasks like identification of relevant documents sectionsand sentences recognition of concepts and relationships be-tween them or even generation of novel hypotheses [102]Introduction of computational text analysis methods into thecuration process has been reported to improve its efficiency bybetween 2- and 10-fold [103] Especially in the biomedical com-munity development of TM approaches has greatly benefitedfrom competitive evaluation challenges [104] Some prominentexamples of such initiatives include BioCreative TRECGenomicsMedicalCDS BioNLP-ST i2b2 and ShAReCLEFeHealth These regular competitions provide a valuable bench-mark for current state of the art for different tasks and serve tohighlight the most important problems focus community ef-forts and establish necessary annotated textual resources

As well as assisting with curation both TM and IR can beapplied in a high-throughput and automated manner howeverthe utility of these applications is still limited by accuracy andgeneralizability [103] In its simplest form various statistical co-occurrence methods can be applied to build networks linkingbiological concepts frequently found together in differentpapers Some examples of resources offering such literature-derived data are miRCancer that compiled data about co-occurrence of microRNA mentions with different cancer types[105] and DISEASES resource which provides similar informa-tion for gene mentions with particular diseases [106]

Extraction of information from text is contingent on unam-biguous identification of entities and relationship types fromwhich model statements can be constructed Therefore TMtechnologies have a natural synergy with ontology-based datamodelling and to a lesser extent Semantic Web formats likeRDF [102 107] The ontologies are used both as controlled vocab-ularies to resolve alternative naming conventions and as sour-ces of common identifiers for recovered concepts andrelationships Some examples of specialized relationship typesthat can be recovered include proteinndashprotein [108] drugndashpro-tein [109] and drugndashdrug [110 111] interactions and associ-ations of genes with mutations [112 113] and phenotypes [114ndash116] Data from IR and TM are increasingly incorporated along-side manually reviewed information by major databases for ex-ample PPIs extracted from literature by fully automatedapproaches are offered by STRING [13 117]

To handle more advanced tasks it is usually necessary tocombine specialized tools into TM workflows eg a hypothesisgeneration tool might rely on output of a relationships identifi-cation tool which in turn relies on named entity recognitionInevitably this raises questions of interoperability of differentanalysis methods One way to address this is has been throughincreasing use of Web services to link up different tools [118]Additionally specialized standards for biomedical textual datainterchange were developed for example XML-based BioCformat [119]

Recent efforts have been exploring the possibility of extract-ing more complex statements from text [120] and in 2015 theBioCreative V challenge has for the first time included a task ofautomated extraction of OpenBEL statements from text [121]During the competition the best result was achieved by theBELMiner framework [122] which used a rule-based approachfor extraction of BEL statements from text and got the highestF-measure of 20 for this task After the challenge BELSmilesystem was developed that reached an even better score of 28on the same dataset [64] BELSmile implemented a semantic-role-labelling approach which relies on recognition of verbs

(predicates) and assignment of roles to associated subjectsob-jects relative to that verb Notably designs of both frameworksintegrated multiple specialized IR and TM tools to handle spe-cific low-level subtasks

It is clear that at present a substantial proportion of valuablebiomedical knowledge is lsquotrappedrsquo in scientific text in a repre-sentation not readily suitable for automated computer-driveninterpretation Appropriate data formats and ontologies are anessential prerequisite for enabling sophisticated TM methods tobegin addressing this challenge The most recent developmentsindicate that current TM approaches are becoming powerfulenough to leverage such rich representations and are alreadyuseful to curators working on complex models of human dis-eases To explore further developments in this important areawe direct reader to the following recent reviews [25 102 123]

Outlook

Many translational and systems medicine projects involve col-lection of molecular and clinical data with a view to the identifi-cation of disease subtypes The data are typically warehousedand analysed for example to extract molecular fingerprintsthat characterize the subtypes and this data interpretation stepof the translational informatics pipeline remains highly chal-lenging Moving from diagnostic molecular patterns to beingable to suggest individualized healthcare pathways is complexData integration links the molecular patterns to backgroundknowledge which can then be reviewed explored and analysedto develop a more detailed understanding of the underlyingbiology that distinguishes disease subtypes

To support this process several standards have emerged tocapture integrate and facilitate analysis of these data This pro-liferation of standards suggests that different formalisms willcontinue to be necessary to model different aspects of biomed-ical data and that we are unlikely to converge on one ultimatemodelling solution in the foreseeable future The open questiontherefore becomes how to effectively manage the interoperabil-ity between these different representations In this respect sev-eral common trends and best practices are beginning to emergeIn particular it is now evident that development of commoncontrolled vocabularies and specialized domain ontologies is es-sential for effective management of increasingly complex bio-medical data Use of ontologies facilitates identification ofequivalent entities and concepts across different modelling for-mats and facilitates conversion between them allowing effi-cient data integration

As outlined by Cohen in [6] modern biomedical data model-ling formalisms can be categorized into lsquocurationrsquo and lsquomechanis-ticrsquo types Aim of the former is to primarily describe experimentalresults (eg Nanopublications BioPAX) including observed asso-ciations between molecular signatures and diseases The latterspecify evaluable models used to simulate living systems andgenerate predictions based on a set of inputs (eg SBML Kappa)It is obvious that although experimental observations are ultim-ately used to construct such models the processing stepsrequired to correctly generate the latter from the former wouldgo beyond a simple format change We believe that automationof this process is a highly promising emergent strategy for devel-opment of fully mechanistic models of disease Use of automatedmethods to support biomedical innovation is necessitated bothby the need to effectively deal with ever-growing volumes of dataand by the increasingly complex nature of biomedical researchitself The apparent diminishing returns in terms of noveldrugs successfully taken to market relative to the money spent

Navigating the disease landscape | 11

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

[124 125] with particular challenges for cardiovascular [126] andneurodegenerative diseases [127 128] mean that increasinglymore complex strategies for managing disease are needed tomake progress [129] Consequently high-throughput computa-tional approaches are becoming indispensable both for assistingwith interpretation management and mining of experimentaldata as well as for constructing and evaluating relevant clinicalmodels At present cutting-edge efforts in this area involve cre-ation of increasingly sophisticated automated methods for ana-lysis of vast quantities biomedical data exemplified by initiativeslike DARPArsquos Big Mechanism IBM Watson Health and Garudabiomedical discovery platforms The future modelling formatsare therefore likely to be influenced by the requirements of suchprojects and will increasingly incorporate features to supportfully computer-driven information analysis

In our view accurate handling of the context of biological ob-servations is critical for construction of correct mechanistic mod-els of disease Traditionally information has been compiled intolargely homogenized collections like pathway maps or globalinteractome networks that aim to show all possible processes of acertain type for a given organism However in practice only asmall subset of all these processes will be occurring in a real cellat a given time Furthermore such events may be transient (en-zymatic reactions) have stable outcome (protein complex forma-tion) or be mutually exclusive (alternative protein-bindingpartners) An additional level of complexity is introduced fromthe imperfection of experimental techniques For example tran-scriptomics profiling is often done at a tissue level which maymask important differences at a level of individual cells As suchmechanistic models built from these data may be subject to con-siderable uncertainty and abstraction The next generation of for-mats and integration solutions will therefore need to be morecontext aware Future progress on one hand requires bettermeans of expressing this highly granular context of biologicalprocesses and on the other necessitates solutions to model vastvolumes of data and knowledge in such form Once efficient andaccurate approaches for these tasks are established cutting-edgemathematical and machine learning methods could be leveragedto their full potential to give new insight into disease mechan-isms and suggest improved avenues for therapeutic intervention

Key Points

bull Precision medicine relies on identification of disease sub-types at a molecular level and linking them to diagnosticmodels to determine optimal treatment strategies

bull In practice establishing a contextual link between molecularsignatures and disease processes requires integration ofqualitatively diverse types of relevant data and knowledge

bull At present several alternative philosophies guide theprocess of transformation of experimental data intomechanistically informative and ultimately clinicallyactionable insights

bull The spectrum of possible approaches ranges frompurely associative knowledge link discovery to fullyquantitative mathematical models

Supplementary Data

Supplementary data are available online at httpsacademicoupcombib

Funding

The research leading to these results has received supportfrom the Innovative Medicines Initiative Joint UndertakingeTRIKS Project (Grant Number IMI 115446) resources ofwhich are composed of a financial contribution from theEuropean Unionrsquos Seventh Framework Programme (FP72007-2013) and European Federation of PharmaceuticalIndustries and Associations (EFPIA) companiesrsquo in kind con-tributions The work was also partially supported by CoreResearch for Evolutional Science and Technology (CREST)Grant from the Japan Science and Technology Agency(Grant Number JPMJCR1412) and Japan Society for thePromotion of Science KAKENHI (Grant Numbers 17H06307and 17H06299)

References1 Robinson PN Deep phenotyping for precision medicine

Hum Mutat 201233(5)777ndash802 Glaab E Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen classificationBrief Bioinform 201617(3)440ndash52

3 Woodruff PG Boushey HA Dolganov GM et al Genome-wide profiling identifies epithelial cell genes associated withasthma and with treatment response to corticosteroids ProcNatl Acad Sci USA 2007104(40)15858ndash63

4 Kaneko Y Yatagai Y Yamada H et al The search for com-mon pathways underlying asthma and COPD Int J ChronObstruct Pulmon Dis 2013865ndash78

5 Hofmann-Apitius M Ball G Gebel S et al Bioinformaticsmining and modeling methods for the identification of dis-ease mechanisms in neurodegenerative disorders Int J MolSci 201516(12)29179ndash206

6 Cohen PR DARPArsquos big mechanism program Phys Biol 201512(4)045008

7 Barabasi AL Network Science Cambridge United KingdomCambridge University Press 2016

8 Le Novere N Quantitative and logic modelling of molecularand gene networks Nat Rev Genet 201516(3)146ndash58

9 Boccaletti S Bianconi G Criado R et al The structure and dy-namics of multilayer networks Phys Rep 2014544(1)1ndash122

10 Vidal M Cusick ME Barabasi AL Interactome networks andhuman disease Cell 2011144(6)986ndash98

11 Pi~nero J Queralt-Rosinach N Bravo A et al DisGeNET a dis-covery platform for the dynamical exploration of humandiseases and their genes Database 20152015bav028

12 Voraphani N Gladwin M Contreras A et al An airway epi-thelial iNOSndashDUOX2ndashthyroid peroxidase metabolome drivesTh1Th2 nitrative stress in human severe asthma MucosalImmunol 20147(5)1175ndash85

13 Franceschini A Szklarczyk D Frankild S et al STRING v9 1protein-protein interaction networks with increasedcoverage and integration Nucleic Acids Res 201241(D1)D808ndash15

14 Montojo J Zuberi K Rodriguez H et al GeneMANIA fastgene network construction and function prediction forCytoscape F1000Res 20143153

15 Goh KI Cusick ME Valle D et al The human disease net-work Proc Natl Acad Sci USA 2007104(21)8685ndash90

16 Menche J Sharma A Kitsak M et al Uncovering disease-disease relationships through the incomplete interactomeScience 2015347(6224)1257601

12 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 5: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

involving prioritizing genes on the basis of their asthma rele-vance using gene expression and GWAS information enabled aranking of pathways and subsequent identification of path-ways not traditionally associated with asthma The plausibilityof one of these (the GAB1 signalosome) was shown experimen-tally The authors also explored the differential expression ofthe GAB1 gene in other immune-related diseases This study il-lustrates the use of both pathways and networks with othersupporting evidence (eg gene expression GWAS) to context-ualize disease genes and provide new avenues for understand-ing disease mechanisms It highlights the need for knowledgerepresentations that can capture multiple levels of informationsuch as the conditions in which a gene may be up-regulated

One approach to representation of the complex networks thatare associated with disease is to decompose the information intolayers with each layer representing information at different lev-els of biological organization Radonjic and co-workers [21 22] usenetwork-based methods to represent and integrate informationat three levels namely differentially expressed biological proc-esses transcription factors whose target proteins are differen-tially expressed and physiological information such as energyintake and weight They explored adaptation of white adipose tis-sue to a high-fat diet a biological system that may give clues as towhich molecular processes are involved in obesity and type 2 dia-betes Gustafsson et al [17] discussed the concept of multilayerdisease modules as a possible framework to generate predictivemarkers that capture multiple types of disease-related informa-tion (PPI symptoms etc) The integration of the layers and the in-clusion of addition contextual information associated with thenodes and relationships of the network (eg cell-type specificities)are important for mechanistic insight into disease Network-based data fusion approaches to integration of multiple typesof biological information relevant to disease are described by[23 24] and these offer the potential to predict new disease-asso-ciated genes as well as to explore diseasendashdisease relationships

The network-based integration of different types of informa-tion can suggest underlying mechanisms of disease Huan andco-workers [25] perform an integrated analysis to explore mech-anisms of blood pressure regulations They used weighted geneco-expression network analysis (WGCNA) [26] to identify co-expressed modules that showed correlation to blood pressuremeasurements and integrated SNP data as well as PPI networksand used Bayesian analysis to identify key drivers of the biolo-gical process one of which was selected for further experimen-tal study The identification of associations between diseasescan suggest a common mechanistic basis [27] and can be usedpredictively [28] Molecular network approaches to representingdisease landscapes offer particular views rather than a unifiedframework In some ways they are similar to information over-lay methods in data analytics Points of correspondence be-tween the different views (networks) need to be found to get anintegrated picture that will be useful for suggesting mechanism

To illustrate these contextualization strategies 22 genesfrom the differential expression signature (derived fromGSE43696 study as outlined in Supplementary Methods) weremapped to protein association network from STRING databaseFirst the results were explored using Web-based interactivevisualization offered by STRING database (Figure 2) As proteinsrealize their function through complex sets of interactions im-plications of gene expression changes can often be betterunderstood by assessing its impact on the wider interactomeThe top panel shows disease signature genes as well as 50 genesfrom the first shell identified as most relevant to this input Theanalysis has suggested several high-confidence modules most

of which contained prominent asthma-associated signallingpathways (HIF [29] TGF-beta [30] and cytokinendashcytokine receptorinteraction [31] and circadian clock-related [32]) Several high-degree nodes connecting these multiple modules are alsoknown to be implicated in asthma tyrosine kinase FYN isinvolved in inflammation [33] IL1B shown to be geneticallyassociated with asthma susceptibility [34] and the node withthe highest degree (SMAD4) is involved in airway smoothmuscle hyperplasia [35]

Given that the number of identified human PPIs is now inthe millions visual exploration is typically limited by the sheeramounts of relevant information that can be usefully displayedApproaches like network diffusion and graph random walksoffer a mathematically robust way to select most relevant nodeseven in large graphs These methods are effective for diseaseanalysis as disease-related genes are commonly located closeto each other in the network In this example we have demon-strated the applicability of this key principle and also identified10 most relevant (closest to the signature) genes by the diffusionstate distance metric [36] This analysis highlighted potentialrelevance of the PTEN gene known to be prominently involvedin airway hyperresponsiveness and inflammation [37] as wellas a small module with multiple members of the PI3KAkt path-way that promotes asthma-related airway remodelling [38]

Contextualization by pathway centricrepresentations

Understanding the mechanism of disease often starts with iden-tification of the aberrant molecular pathways that may be asso-ciated with the disease Molecular pathway models aim tocapture sequences of actions or events of molecular lsquoactorsrsquoSome examples of commonly modelled events include biochem-ical reactions state changes and movements between differentcellular compartments Mapping known disease genes to knownpathways is a well-established methodology for contextualiza-tion as pathways capture an intermediate level of organizationbetween molecular entities and phenotype The identification ofhallmark pathways believed to be associated with disease condi-tions has been used to create pathway centric disease maps [39]which are useful for visual exploration and for the overlay ofdata from for example gene expression studies Typically suchapproaches involved identification of disease-related pathwaysby careful manual curation of the literature AlzPathway [40] is acollection of signalling pathways in Alzheimerrsquos disease Themap is represented in the Systems Biology Graphical Notation(SBGN) [41] process description (PD) language and was also madeavailable in other formats The resource also offers a Web inter-face that allows users to review and comment on annotationsthus facilitating community curation Another example of apathway-based disease map is the Parkinsonrsquos disease map [42]which builds on an understanding of the hallmark metabolicregulatory and signalling pathways of the disease [39] TheParkinsonrsquos disease map can be accessed using the Minerva Webserver which has advanced functionality for visualization (usingthe GoogleMaps API) that facilitates manual exploration [43]The Atlas of Cancer Signalling Networks [44] uses a similar ap-proach where several types of biological entities like pheno-type ion and drug can be represented on a zoomable map inSGBN format Clicking on an entity brings up a box with its de-scription and relevant annotations

Ideally a computational framework for contextualizing dis-ease-implicated genes should allow annotation by members of

Navigating the disease landscape | 5

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

the research community and some selected frameworks offerthis important functionality WikiPathways [45] contains a setof open evolving updateable disease-associated pathwaysPathWhiz [46] is a Web-based tool that allows users to drawpathways so as to include contextual information such as cellstissues and organs The pathway maps are designed to be visu-ally appealing to facilitate manual exploration PathWhizincludes metabolic signalling as well as a number of disease-associated pathways Recently the PathWay Collage tool [47]has been developed that allows the visualization of fragmentsof pathways (defined by the user so as to reduce the complexity

of full pathway representations) onto which omics data can beoverlaid

At present there are two leading exchange standardsadopted for representing biochemical pathway networks Thefirst one is Systems Biology Markup Language (SBML) [48] whichwas primarily intended to support development and sharing ofquantitative biochemical models Therefore SBML has been de-signed to be generic with the main building blocks being spe-cies (quantifiable physical entities) processes (which definehow entities are manipulated) and compartments (provide low-level context for processes and entities) As SBML format aims

BLM

CHEK1PLXND1TIPIN

SEMA3ECLSPN

ATRMAP3K6

HNMT

KCNN4MCM2DPPA5

CMCMMMMMMMMMMMMMMMMMMMMMMMMTIMELESSPER1

ARNTL SYT13

CLOCKBMP5CAMK2A GDF5GDGD

SMAD1CRY2 CAMK2D

SMAD4CAMK2G BMPR2MYO19 MPR2

ACVR2BSMAD5CAMK2BGRIN2B

FHOD1 SMAD9GRIN2A

D9D99D9D99D99999999999999999999999999DD9D99999999999999999999999999999

BMPR1ASTAP2 BMP4

CALM2 BMP10BMP7CALM1

BMP2KRT73

PTK6 SMAD6NOG

SMAD7KHDRBS1KHDRBS2 FYN

77777777777777777

BMP6IL1B

IL1A

HSD17B13 IL1R2UTRN

WNK4 IL19IL20RBDAG1

RAPSNNPHS1NNNCD2AP

IL20AGRN

IL20RA

SLCO1B3NPHS2NNN

KIRRELIL22RA1

CPXM1NAT8

ab

c

1

2

3

4

lt 22eminus16

000023

0000000000011

10

20

30

Random

DisGeNET signature

Severe Asthma signature

Severe Asthma vs DisGeNETYWHAQ

FLT3YWHAE ILK

YWHABYWHAG

YWHAZ SFNPTEN

MATK

Diffu

sio

n s

tate

dis

tan

ce

Distances between

members of gene signatures

Ten closest genes to severe

asthma gene expression signature

ILK

FLT3

PTEN

YWHABB

YWHAQYWHAE

MATK

YWHAZ

SFNYWHAG

A

B C

Figure 2 Analysis of severe asthma differential expression gene signature in the context of STRING protein association network (A) Fifty genes in the first shell of the

22-gene signature (andashc) high-degree module interconnectivity genes (1) calmodulins and HIF-1 signalling pathway (2) TGF-beta signalling pathway (3) cytokinendashcyto-

kine receptor interaction pathway and (4) circadian rhythm-related genes (B) Diffusion state distance of gene sets this measure is derived from similarity of random

walk profiles for each pair of nodes In greymdashrandom pairs of nodes Distances between members of asthma signatures to each other are shown in blue (severe asthma

versus DisGeNet category shows distances between members of the two sets) Finally 10 closest genes to the differential expression signature set are shown in green

(C) Ten closest genes to the severe asthma signature visualized in the STRING database viewer

6 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

to be generic further subcategories of these concepts are notdefined as part of the standard and instead particular em-phasis is placed on enabling mathematical definition of themodel (eg in the form of ordinary differential equations) Someof the necessary biological underpinnings are added via arelated project SBGN [41] standard which defines how differentSBML entities can be represented graphically In SBML fine-grained annotation or specification of complex contexts is pos-sible through special fields set aside for user data However asthese fields are not bound by the standard they may not besupported by all of the tools and therefore generally would notbe used in evaluation of the model

The SBGN PD language [41] was designed to give an unam-biguous representation of mechanism of action within a biolo-gical pathway that can be interpreted visually to facilitatemanual exploration The PD language can represent all thesteps in for example a reaction showing how a biological entitychanges in a temporal manner However to fully exploit the PDrepresentation for applications like numerical modelling wouldrequire additional data associated with component steps (suchas rate constants) which often is not readily available

Another prominent data exchange standard is BioPAX [49]which was primarily designed for unambiguous sharing of bio-chemical pathway data and therefore offers much wider selectionof domain-specific terms to characterize and categorize bothinteractors and interactions As Ontology Web Language (OWL)forms the core of this standard all of the terms are organized hier-archically and relationships between them are formally definedBioPAX standard aims to facilitate data sharing and integrationvia Semantic Web technologies and Resource Description Format(RDF) If an RDF representation is used the format can be easilyextended to incorporate complex context information howeverBioPAX currently does not aim to offer the means of defining ex-perimental or biomedical context for pathways as by designother standardsontologies can be brought in to model them

Both SMBL and BioPAX are predicated on the developmentsof the past decade when several key technologies and designprinciples were introduced that greatly influenced how biomed-ical data-sharing standards are implemented One of these de-velopments was increasing adoption of formal ontologies thatallowed development of well-structured controlled vocabulariesfor different domains and definition of cross-mappings betweenthem For example the SBML standard allows annotation ofmodels with System Biology Ontology (SBO) terms which arealso used in SBGN and can be mapped on to BioPax via SBPAX[50] Therefore in principle it is possible to integrate modelsacross all of these standards on a qualitative level On a syntac-tic level different formalisms can be integrated by introducingsupport for a low-level common language like RDF which isrepresentation-agnostic and allows all types of data to be linkedvia common identifiers Both of these solutions enable modernstandards to be designed in a modular way where differentspecialized representations can be developed by different com-munities of domain experts and then combined togethermdashoreven extended for the needs of a particular applied project

In recent years the community efforts have been increas-ingly devoted to increasing inter-compatibility across all mainbiomedical standards These efforts include development ofconverters controlled vocabulary mappings and better supportof different standards both by data providers and analytical tooldevelopers Greater compatibly also means that smaller models(eg from individual studies) can now be meaningfully sharedand combined One prominent resource that aims to facilitatethis process is NDex [51] NDex allows researchers to share their

models in a variety of different formats which can then be useddirectly by using the NDex website like a data warehouse orintegrated into a common representation by using the NDexCytoscape [52] plug-in

Pathway-based representations enable relevant methods toincorporate the directionality and interaction type (eg up- ordown-regulation) and in this way they can capture causal as-sociations underlying different disease mechanisms To illus-trate this application we have used lsquoTied Diffusion throughInteracting Eventsrsquo method [53] which can identify significantlyimplicated pathways connecting two sets of genes (in this caseDisGeNET asthma signature and our severe asthma signature)The diffusion model used by this method can take into accountmagnitude of effects direction and type of interactions Afterapplying this approach the regulatory subnetwork fromReactome found to be significant for these two sets wasvisualized in Cytoscape (Figure 3) It is possible to see that twointerleukin genes (both involved airway inflammation [54])were found to be particularly critical with lower panel showingkey associations involved

Contextualization by knowledgegraph representations

There are many specialist sources of disease-related informa-tion such as databases of mutations and their consequences Inaddition there is vast information in unstructured formats inthe scientific literature and much of it is contextual for ex-ample statements that describe correlations between geneproducts or between small molecules and gene products thatare observed in particular cell types Information in the litera-ture may also describe weak associations only seen in certainconditions or for which there may be limited evidence thatcould nevertheless still be useful for hypothesis generation

The Biological Expression Language (BEL) [55] representsstatements in biology typically correlative and causative state-ments as well as the context in which these statements applyA biological statement is represented as a triple (subject predi-cate and object) where the subject is a BEL term and the predi-cate is a biological relationship that connects the subject withobject A BEL term can be the abundance of a biological entity orit can be a biological process such as one described in the GeneOntology (GO) [56] or even a given disease condition Each en-tity is described in an existing namespace (eg from databaseslike ChEBI [57]) or in particular user-defined namespace(s)There is a fixed set of causal correlative and other relationshipsthat link these entities The BEL complier checks that the state-ments are syntactically correct and carries out the integrationand alignment of the data by establishing equivalences betweenidentical concepts

The knowledge captured by BEL statements can be repre-sented as a graph The object can be a BEL term or a BEL state-ment Each statement can be associated with metadata thatcontextualizes it for example qualifying it to be true only inspecific tissues The BEL knowledge network can facilitate datainterpretation using a reverse causal reasoning approach [58]For example gene expression data can be mapped to smallersubnetworks that represent cause and effect models The extentto which the models can explain the measurements of differen-tially expressed genes can then be assessed and potentiallyyield insight into the mechanism

Hofmann-Apitius [5] posits that an understanding of diseasemechanisms involves integrating information at multiple

Navigating the disease landscape | 7

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

biological scales To some extent BEL offers a multiscale repre-sentation that other languages do not Using BEL Hofmann-Apitius and co-workers have explored mechanistic aspects ofAlzheimerrsquos disease [59ndash62] They used the BEL language to rep-resent information from the literature together with relevantexperimental data sets to build integrated evidence networksThis approach helps to mine what they refer to as the know-ledge lsquogrey zonersquo which includes both established and emerg-ing connections between entities This approach can bepowerful for linking mechanisms associated with biomarkers[60]

Constructing such information-rich representations can in-volve considerable manual curation efforts For this reason TM

will be increasingly important for supporting this process [63]Already existing examples include the BEL InformationExtraction workFlow (BELIEF) system and BELSmile [64] Theseframeworks work by using automated TM to detect BEL con-cepts and relationships and then allowing the curators to ex-plore and annotate their biological contexts to create final high-quality networks

Another example of a specialized curation data modellingformat is the Nanopublication standard [65] Nanopublicationdefines rules for publishing data in its simplest possible for-msmdashindividual assertions in the form of statements composedof a subject predicate and object plus supporting provenanceinformation and relevant metadata Nanopublication is

Figure 3 Significant pathways identified using Tied Diffusion through Interacting Events method an approach that can extract the most likely set of interconnectors

between two sets of nodes In this example core asthma genes from DisGeNET and severe differential expression signature were considered in the context of

Reactome regulatory pathway network In both panels colour intensity indicates weight magnitude assigned by the algorithm (A) Complete set of all significant genes

Dashed lines indicate undirected edges of lsquocomponentrsquo type whereas solid edges show directed interactions (B) subset of the network relating to the IL1RN gene

Arrow and bar-terminated edge styles show activation and inhibition respectively

8 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

designed to be used in combination with Semantic Web tech-nologies and uses RDF a format that allows seamless incorpor-ation of ontology terms and cross-references through globallyunique shared identifiers The advantages of this standard in-clude decentralized (federated) publishing machine-readablerepresentation of data and an ability to track provenance at thehighest possible level of granularity

Knowledge network representations of biological data are in-creasingly being leveraged in in integrative analysis platformswhich aim to combine information resources with commonanalysis and visualization tasks One example of such platformsis Ingenuity Pathway Analysis (IPAVR ) IPAVR suite aims to facili-tate interpretation of experimental data from high-throughputomics experiments and includes methods for identification ofcausal networks and visual exploration of biological graphs [66]The BioMax BioXMTM platform uses an object relational data-base architecture to store query and envision disease informa-tion including molecular genetic physiological and clinicaldata as well as biological models This platform has been usedto create a resource for chronic obstructive pulmonary disease(COPD) called lsquoCOPD knowledge basersquo [67ndash70] The authors ofCOPD knowledge base do acknowledge that much disease-related information remains hidden in the literature and alsorecognize the difficulties of capturing quality and context speci-ficity The Malacards resource [71] uses a relational database forthe semantic integration of multiple sources of informationabout diseases and also provides network visualizations of dis-easendashdisease relationships GeneAnalytics [72] enables context-ualization of functional signatures (gene sets) in the context oforgans and tissues and compartments as well as other annota-tion sources building on the GeneCards [73] and Malacardstechnologies

Recent advances in graph-based databases have opened upadditional possibilities for organizing heterogeneous and inter-linked data Graph databases have several features that makethem particularly attractive for management and exploratoryanalysis of complex biological data sets A particularly notewor-thy feature is excellent performance of traversal-type queriesacross entities of diverse types they offer Such queries could bechallenging to realize in relational databases because of the costof computing statements across different tables Traversalqueries can be an important way of identifying new relation-ships in integrated data particularly in the cases where linksonly become apparent once multiple sources have been inte-grated but may not be immediately obvious because of sheeramounts of data being involved Modern graph databases offersophisticated query languages and often also other types offramework-specific functionality to facilitate applied analysislike an ability to extend the framework with user-defined func-tions [74] and integration with ontologies and reasoning sys-tems One example of a graph database solution that has beenalready used in biomedical domain [75ndash78] is Neo4J DBMSNeo4J is a Java-based solution which also offers its own graphquery language (Cypher) which is conceptually similar to SQL

Semantic Web [79] RDF [80] and Linked Open Data [81] areemerging important standards for sharing biomedical data Atits core Semantic Web is characterized by its use of globallyunique unambiguous identifiers for entities The identifier usedfor this purpose is commonly a web link (URL) The RDF formatis structured as a set of statements with three parts (subjectpredicate and object) where a subject and object are entitieswith a unique identifier and predicate defines the type of a re-lationship between them Linked Open Data standard estab-lished a set of rules for providing meaningful RDF at the

resolvable identifier URL which generally is in the form of state-ments that provide meaningful information about that entitythat can in turn link it to other URL-identified entities and fi-nally the entities and statements can be linked to specializedontologies in OWL [82] that provide additional informationabout how to interpret them within a particular context

From the perspective of biomedical data integration RDFprimarily plays the role of a low-level data exchange standarduseful for addressing the syntactic heterogeneity issue andfacilitating data integration Although RDF representations canbe interpreted as integrated knowledge networks usually anadditional set of standards will be necessary for meaningful in-terpretation of data from a biological perspective Prominent re-sources that offer biological data on the Semantic Web includeEBI RDF platform [83] (covers such important databases likeUniProt [84] Reactome [85] Ensembl [86] ChEMBL [87] andExpression Atlas [88]) Bio2RDF (aims to mirror in RDF most ofthe prominent biological databases) [89] and OpenPhacts (usesSemantic Web technologies to represent the chemogenomicsspace) [90] Some noteworthy ontologies used in combinationwith these resources include Experimental Factor Ontology(EFO) [91] SBO [92] Translational Medicine Ontology (TMO) [93]SNOMED (medical terms and concepts) [94] and GO (functionalannotation of genes and proteins) [56]

To illustrate the application of integrated knowledge net-works with our example 22-gene severe asthma signature wehave used a Neo4j-based Hetionet v10 resource [95] that inte-grates a wide variety of biomedical information First a querywas prepared to visualize all of the diseases linked to the sig-nature genes (Figure 4) According to these results only onegene has already been linked to asthma though three othergenes have been linked to other lung diseases (COPD and idio-pathic pulmonary fibrosis) Next the connections betweenknown asthma-related pathways severe asthma differentialexpression signature and drugs targeting those common path-ways were retrieved In terms of query-specific degree of drugnodes (ie number of links to returned proteins) top threedrugs were dexamethasone betamethasone and niclosamideThe former two are corticosteroids already used for treatingasthma [96] and the last one was recently proposed as a highlypromising treatment because of its strong TMEM16A antagon-ism found to improve lung function in human and mousemodels [97]

Approaches for extraction ofcontextual knowledge

Ultimately all types of models in biomedical domain originatefrom data collected during relevant experiments and such datawould typically require both different levels of preprocessing andinterpretation to yield such models At present relatively fewtypes of raw experimental data can be straightforwardly con-verted into these representations without requiring human inter-pretation Some possible examples are construction of co-expression networks [98 99] from transcriptomic data and se-quence homology graphs [100 101] However currently manualinterpretation and publishing of inferred models as free text inscientific papers remains the primary means of disseminatingthis knowledge and all major biomedical databases discussed inthis review expend considerable curation efforts to collect verifyand consolidate it in standardized formats from source literature

Given the volume and ever-increasing rate of scientific pub-lishing automated means of facilitating this process are

Navigating the disease landscape | 9

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Figure 4 Query expressed in natural language and resulting output produced for the differential expression gene signature using Hetionet graph database The follow-

ing entities are shown proteins (blue) pathways (yellow) diseases (brown) and drugs (red) (A) Query for associated diseases (B) query to explore the connections be-

tween niclosamide drug asthma and differential expression signature linked via relevant proteins and pathways

10 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

becoming increasingly essential This is achieved through appli-cations of information retrieval (IR) and TM methods which canhandle tasks like identification of relevant documents sectionsand sentences recognition of concepts and relationships be-tween them or even generation of novel hypotheses [102]Introduction of computational text analysis methods into thecuration process has been reported to improve its efficiency bybetween 2- and 10-fold [103] Especially in the biomedical com-munity development of TM approaches has greatly benefitedfrom competitive evaluation challenges [104] Some prominentexamples of such initiatives include BioCreative TRECGenomicsMedicalCDS BioNLP-ST i2b2 and ShAReCLEFeHealth These regular competitions provide a valuable bench-mark for current state of the art for different tasks and serve tohighlight the most important problems focus community ef-forts and establish necessary annotated textual resources

As well as assisting with curation both TM and IR can beapplied in a high-throughput and automated manner howeverthe utility of these applications is still limited by accuracy andgeneralizability [103] In its simplest form various statistical co-occurrence methods can be applied to build networks linkingbiological concepts frequently found together in differentpapers Some examples of resources offering such literature-derived data are miRCancer that compiled data about co-occurrence of microRNA mentions with different cancer types[105] and DISEASES resource which provides similar informa-tion for gene mentions with particular diseases [106]

Extraction of information from text is contingent on unam-biguous identification of entities and relationship types fromwhich model statements can be constructed Therefore TMtechnologies have a natural synergy with ontology-based datamodelling and to a lesser extent Semantic Web formats likeRDF [102 107] The ontologies are used both as controlled vocab-ularies to resolve alternative naming conventions and as sour-ces of common identifiers for recovered concepts andrelationships Some examples of specialized relationship typesthat can be recovered include proteinndashprotein [108] drugndashpro-tein [109] and drugndashdrug [110 111] interactions and associ-ations of genes with mutations [112 113] and phenotypes [114ndash116] Data from IR and TM are increasingly incorporated along-side manually reviewed information by major databases for ex-ample PPIs extracted from literature by fully automatedapproaches are offered by STRING [13 117]

To handle more advanced tasks it is usually necessary tocombine specialized tools into TM workflows eg a hypothesisgeneration tool might rely on output of a relationships identifi-cation tool which in turn relies on named entity recognitionInevitably this raises questions of interoperability of differentanalysis methods One way to address this is has been throughincreasing use of Web services to link up different tools [118]Additionally specialized standards for biomedical textual datainterchange were developed for example XML-based BioCformat [119]

Recent efforts have been exploring the possibility of extract-ing more complex statements from text [120] and in 2015 theBioCreative V challenge has for the first time included a task ofautomated extraction of OpenBEL statements from text [121]During the competition the best result was achieved by theBELMiner framework [122] which used a rule-based approachfor extraction of BEL statements from text and got the highestF-measure of 20 for this task After the challenge BELSmilesystem was developed that reached an even better score of 28on the same dataset [64] BELSmile implemented a semantic-role-labelling approach which relies on recognition of verbs

(predicates) and assignment of roles to associated subjectsob-jects relative to that verb Notably designs of both frameworksintegrated multiple specialized IR and TM tools to handle spe-cific low-level subtasks

It is clear that at present a substantial proportion of valuablebiomedical knowledge is lsquotrappedrsquo in scientific text in a repre-sentation not readily suitable for automated computer-driveninterpretation Appropriate data formats and ontologies are anessential prerequisite for enabling sophisticated TM methods tobegin addressing this challenge The most recent developmentsindicate that current TM approaches are becoming powerfulenough to leverage such rich representations and are alreadyuseful to curators working on complex models of human dis-eases To explore further developments in this important areawe direct reader to the following recent reviews [25 102 123]

Outlook

Many translational and systems medicine projects involve col-lection of molecular and clinical data with a view to the identifi-cation of disease subtypes The data are typically warehousedand analysed for example to extract molecular fingerprintsthat characterize the subtypes and this data interpretation stepof the translational informatics pipeline remains highly chal-lenging Moving from diagnostic molecular patterns to beingable to suggest individualized healthcare pathways is complexData integration links the molecular patterns to backgroundknowledge which can then be reviewed explored and analysedto develop a more detailed understanding of the underlyingbiology that distinguishes disease subtypes

To support this process several standards have emerged tocapture integrate and facilitate analysis of these data This pro-liferation of standards suggests that different formalisms willcontinue to be necessary to model different aspects of biomed-ical data and that we are unlikely to converge on one ultimatemodelling solution in the foreseeable future The open questiontherefore becomes how to effectively manage the interoperabil-ity between these different representations In this respect sev-eral common trends and best practices are beginning to emergeIn particular it is now evident that development of commoncontrolled vocabularies and specialized domain ontologies is es-sential for effective management of increasingly complex bio-medical data Use of ontologies facilitates identification ofequivalent entities and concepts across different modelling for-mats and facilitates conversion between them allowing effi-cient data integration

As outlined by Cohen in [6] modern biomedical data model-ling formalisms can be categorized into lsquocurationrsquo and lsquomechanis-ticrsquo types Aim of the former is to primarily describe experimentalresults (eg Nanopublications BioPAX) including observed asso-ciations between molecular signatures and diseases The latterspecify evaluable models used to simulate living systems andgenerate predictions based on a set of inputs (eg SBML Kappa)It is obvious that although experimental observations are ultim-ately used to construct such models the processing stepsrequired to correctly generate the latter from the former wouldgo beyond a simple format change We believe that automationof this process is a highly promising emergent strategy for devel-opment of fully mechanistic models of disease Use of automatedmethods to support biomedical innovation is necessitated bothby the need to effectively deal with ever-growing volumes of dataand by the increasingly complex nature of biomedical researchitself The apparent diminishing returns in terms of noveldrugs successfully taken to market relative to the money spent

Navigating the disease landscape | 11

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

[124 125] with particular challenges for cardiovascular [126] andneurodegenerative diseases [127 128] mean that increasinglymore complex strategies for managing disease are needed tomake progress [129] Consequently high-throughput computa-tional approaches are becoming indispensable both for assistingwith interpretation management and mining of experimentaldata as well as for constructing and evaluating relevant clinicalmodels At present cutting-edge efforts in this area involve cre-ation of increasingly sophisticated automated methods for ana-lysis of vast quantities biomedical data exemplified by initiativeslike DARPArsquos Big Mechanism IBM Watson Health and Garudabiomedical discovery platforms The future modelling formatsare therefore likely to be influenced by the requirements of suchprojects and will increasingly incorporate features to supportfully computer-driven information analysis

In our view accurate handling of the context of biological ob-servations is critical for construction of correct mechanistic mod-els of disease Traditionally information has been compiled intolargely homogenized collections like pathway maps or globalinteractome networks that aim to show all possible processes of acertain type for a given organism However in practice only asmall subset of all these processes will be occurring in a real cellat a given time Furthermore such events may be transient (en-zymatic reactions) have stable outcome (protein complex forma-tion) or be mutually exclusive (alternative protein-bindingpartners) An additional level of complexity is introduced fromthe imperfection of experimental techniques For example tran-scriptomics profiling is often done at a tissue level which maymask important differences at a level of individual cells As suchmechanistic models built from these data may be subject to con-siderable uncertainty and abstraction The next generation of for-mats and integration solutions will therefore need to be morecontext aware Future progress on one hand requires bettermeans of expressing this highly granular context of biologicalprocesses and on the other necessitates solutions to model vastvolumes of data and knowledge in such form Once efficient andaccurate approaches for these tasks are established cutting-edgemathematical and machine learning methods could be leveragedto their full potential to give new insight into disease mechan-isms and suggest improved avenues for therapeutic intervention

Key Points

bull Precision medicine relies on identification of disease sub-types at a molecular level and linking them to diagnosticmodels to determine optimal treatment strategies

bull In practice establishing a contextual link between molecularsignatures and disease processes requires integration ofqualitatively diverse types of relevant data and knowledge

bull At present several alternative philosophies guide theprocess of transformation of experimental data intomechanistically informative and ultimately clinicallyactionable insights

bull The spectrum of possible approaches ranges frompurely associative knowledge link discovery to fullyquantitative mathematical models

Supplementary Data

Supplementary data are available online at httpsacademicoupcombib

Funding

The research leading to these results has received supportfrom the Innovative Medicines Initiative Joint UndertakingeTRIKS Project (Grant Number IMI 115446) resources ofwhich are composed of a financial contribution from theEuropean Unionrsquos Seventh Framework Programme (FP72007-2013) and European Federation of PharmaceuticalIndustries and Associations (EFPIA) companiesrsquo in kind con-tributions The work was also partially supported by CoreResearch for Evolutional Science and Technology (CREST)Grant from the Japan Science and Technology Agency(Grant Number JPMJCR1412) and Japan Society for thePromotion of Science KAKENHI (Grant Numbers 17H06307and 17H06299)

References1 Robinson PN Deep phenotyping for precision medicine

Hum Mutat 201233(5)777ndash802 Glaab E Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen classificationBrief Bioinform 201617(3)440ndash52

3 Woodruff PG Boushey HA Dolganov GM et al Genome-wide profiling identifies epithelial cell genes associated withasthma and with treatment response to corticosteroids ProcNatl Acad Sci USA 2007104(40)15858ndash63

4 Kaneko Y Yatagai Y Yamada H et al The search for com-mon pathways underlying asthma and COPD Int J ChronObstruct Pulmon Dis 2013865ndash78

5 Hofmann-Apitius M Ball G Gebel S et al Bioinformaticsmining and modeling methods for the identification of dis-ease mechanisms in neurodegenerative disorders Int J MolSci 201516(12)29179ndash206

6 Cohen PR DARPArsquos big mechanism program Phys Biol 201512(4)045008

7 Barabasi AL Network Science Cambridge United KingdomCambridge University Press 2016

8 Le Novere N Quantitative and logic modelling of molecularand gene networks Nat Rev Genet 201516(3)146ndash58

9 Boccaletti S Bianconi G Criado R et al The structure and dy-namics of multilayer networks Phys Rep 2014544(1)1ndash122

10 Vidal M Cusick ME Barabasi AL Interactome networks andhuman disease Cell 2011144(6)986ndash98

11 Pi~nero J Queralt-Rosinach N Bravo A et al DisGeNET a dis-covery platform for the dynamical exploration of humandiseases and their genes Database 20152015bav028

12 Voraphani N Gladwin M Contreras A et al An airway epi-thelial iNOSndashDUOX2ndashthyroid peroxidase metabolome drivesTh1Th2 nitrative stress in human severe asthma MucosalImmunol 20147(5)1175ndash85

13 Franceschini A Szklarczyk D Frankild S et al STRING v9 1protein-protein interaction networks with increasedcoverage and integration Nucleic Acids Res 201241(D1)D808ndash15

14 Montojo J Zuberi K Rodriguez H et al GeneMANIA fastgene network construction and function prediction forCytoscape F1000Res 20143153

15 Goh KI Cusick ME Valle D et al The human disease net-work Proc Natl Acad Sci USA 2007104(21)8685ndash90

16 Menche J Sharma A Kitsak M et al Uncovering disease-disease relationships through the incomplete interactomeScience 2015347(6224)1257601

12 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 6: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

the research community and some selected frameworks offerthis important functionality WikiPathways [45] contains a setof open evolving updateable disease-associated pathwaysPathWhiz [46] is a Web-based tool that allows users to drawpathways so as to include contextual information such as cellstissues and organs The pathway maps are designed to be visu-ally appealing to facilitate manual exploration PathWhizincludes metabolic signalling as well as a number of disease-associated pathways Recently the PathWay Collage tool [47]has been developed that allows the visualization of fragmentsof pathways (defined by the user so as to reduce the complexity

of full pathway representations) onto which omics data can beoverlaid

At present there are two leading exchange standardsadopted for representing biochemical pathway networks Thefirst one is Systems Biology Markup Language (SBML) [48] whichwas primarily intended to support development and sharing ofquantitative biochemical models Therefore SBML has been de-signed to be generic with the main building blocks being spe-cies (quantifiable physical entities) processes (which definehow entities are manipulated) and compartments (provide low-level context for processes and entities) As SBML format aims

BLM

CHEK1PLXND1TIPIN

SEMA3ECLSPN

ATRMAP3K6

HNMT

KCNN4MCM2DPPA5

CMCMMMMMMMMMMMMMMMMMMMMMMMMTIMELESSPER1

ARNTL SYT13

CLOCKBMP5CAMK2A GDF5GDGD

SMAD1CRY2 CAMK2D

SMAD4CAMK2G BMPR2MYO19 MPR2

ACVR2BSMAD5CAMK2BGRIN2B

FHOD1 SMAD9GRIN2A

D9D99D9D99D99999999999999999999999999DD9D99999999999999999999999999999

BMPR1ASTAP2 BMP4

CALM2 BMP10BMP7CALM1

BMP2KRT73

PTK6 SMAD6NOG

SMAD7KHDRBS1KHDRBS2 FYN

77777777777777777

BMP6IL1B

IL1A

HSD17B13 IL1R2UTRN

WNK4 IL19IL20RBDAG1

RAPSNNPHS1NNNCD2AP

IL20AGRN

IL20RA

SLCO1B3NPHS2NNN

KIRRELIL22RA1

CPXM1NAT8

ab

c

1

2

3

4

lt 22eminus16

000023

0000000000011

10

20

30

Random

DisGeNET signature

Severe Asthma signature

Severe Asthma vs DisGeNETYWHAQ

FLT3YWHAE ILK

YWHABYWHAG

YWHAZ SFNPTEN

MATK

Diffu

sio

n s

tate

dis

tan

ce

Distances between

members of gene signatures

Ten closest genes to severe

asthma gene expression signature

ILK

FLT3

PTEN

YWHABB

YWHAQYWHAE

MATK

YWHAZ

SFNYWHAG

A

B C

Figure 2 Analysis of severe asthma differential expression gene signature in the context of STRING protein association network (A) Fifty genes in the first shell of the

22-gene signature (andashc) high-degree module interconnectivity genes (1) calmodulins and HIF-1 signalling pathway (2) TGF-beta signalling pathway (3) cytokinendashcyto-

kine receptor interaction pathway and (4) circadian rhythm-related genes (B) Diffusion state distance of gene sets this measure is derived from similarity of random

walk profiles for each pair of nodes In greymdashrandom pairs of nodes Distances between members of asthma signatures to each other are shown in blue (severe asthma

versus DisGeNet category shows distances between members of the two sets) Finally 10 closest genes to the differential expression signature set are shown in green

(C) Ten closest genes to the severe asthma signature visualized in the STRING database viewer

6 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

to be generic further subcategories of these concepts are notdefined as part of the standard and instead particular em-phasis is placed on enabling mathematical definition of themodel (eg in the form of ordinary differential equations) Someof the necessary biological underpinnings are added via arelated project SBGN [41] standard which defines how differentSBML entities can be represented graphically In SBML fine-grained annotation or specification of complex contexts is pos-sible through special fields set aside for user data However asthese fields are not bound by the standard they may not besupported by all of the tools and therefore generally would notbe used in evaluation of the model

The SBGN PD language [41] was designed to give an unam-biguous representation of mechanism of action within a biolo-gical pathway that can be interpreted visually to facilitatemanual exploration The PD language can represent all thesteps in for example a reaction showing how a biological entitychanges in a temporal manner However to fully exploit the PDrepresentation for applications like numerical modelling wouldrequire additional data associated with component steps (suchas rate constants) which often is not readily available

Another prominent data exchange standard is BioPAX [49]which was primarily designed for unambiguous sharing of bio-chemical pathway data and therefore offers much wider selectionof domain-specific terms to characterize and categorize bothinteractors and interactions As Ontology Web Language (OWL)forms the core of this standard all of the terms are organized hier-archically and relationships between them are formally definedBioPAX standard aims to facilitate data sharing and integrationvia Semantic Web technologies and Resource Description Format(RDF) If an RDF representation is used the format can be easilyextended to incorporate complex context information howeverBioPAX currently does not aim to offer the means of defining ex-perimental or biomedical context for pathways as by designother standardsontologies can be brought in to model them

Both SMBL and BioPAX are predicated on the developmentsof the past decade when several key technologies and designprinciples were introduced that greatly influenced how biomed-ical data-sharing standards are implemented One of these de-velopments was increasing adoption of formal ontologies thatallowed development of well-structured controlled vocabulariesfor different domains and definition of cross-mappings betweenthem For example the SBML standard allows annotation ofmodels with System Biology Ontology (SBO) terms which arealso used in SBGN and can be mapped on to BioPax via SBPAX[50] Therefore in principle it is possible to integrate modelsacross all of these standards on a qualitative level On a syntac-tic level different formalisms can be integrated by introducingsupport for a low-level common language like RDF which isrepresentation-agnostic and allows all types of data to be linkedvia common identifiers Both of these solutions enable modernstandards to be designed in a modular way where differentspecialized representations can be developed by different com-munities of domain experts and then combined togethermdashoreven extended for the needs of a particular applied project

In recent years the community efforts have been increas-ingly devoted to increasing inter-compatibility across all mainbiomedical standards These efforts include development ofconverters controlled vocabulary mappings and better supportof different standards both by data providers and analytical tooldevelopers Greater compatibly also means that smaller models(eg from individual studies) can now be meaningfully sharedand combined One prominent resource that aims to facilitatethis process is NDex [51] NDex allows researchers to share their

models in a variety of different formats which can then be useddirectly by using the NDex website like a data warehouse orintegrated into a common representation by using the NDexCytoscape [52] plug-in

Pathway-based representations enable relevant methods toincorporate the directionality and interaction type (eg up- ordown-regulation) and in this way they can capture causal as-sociations underlying different disease mechanisms To illus-trate this application we have used lsquoTied Diffusion throughInteracting Eventsrsquo method [53] which can identify significantlyimplicated pathways connecting two sets of genes (in this caseDisGeNET asthma signature and our severe asthma signature)The diffusion model used by this method can take into accountmagnitude of effects direction and type of interactions Afterapplying this approach the regulatory subnetwork fromReactome found to be significant for these two sets wasvisualized in Cytoscape (Figure 3) It is possible to see that twointerleukin genes (both involved airway inflammation [54])were found to be particularly critical with lower panel showingkey associations involved

Contextualization by knowledgegraph representations

There are many specialist sources of disease-related informa-tion such as databases of mutations and their consequences Inaddition there is vast information in unstructured formats inthe scientific literature and much of it is contextual for ex-ample statements that describe correlations between geneproducts or between small molecules and gene products thatare observed in particular cell types Information in the litera-ture may also describe weak associations only seen in certainconditions or for which there may be limited evidence thatcould nevertheless still be useful for hypothesis generation

The Biological Expression Language (BEL) [55] representsstatements in biology typically correlative and causative state-ments as well as the context in which these statements applyA biological statement is represented as a triple (subject predi-cate and object) where the subject is a BEL term and the predi-cate is a biological relationship that connects the subject withobject A BEL term can be the abundance of a biological entity orit can be a biological process such as one described in the GeneOntology (GO) [56] or even a given disease condition Each en-tity is described in an existing namespace (eg from databaseslike ChEBI [57]) or in particular user-defined namespace(s)There is a fixed set of causal correlative and other relationshipsthat link these entities The BEL complier checks that the state-ments are syntactically correct and carries out the integrationand alignment of the data by establishing equivalences betweenidentical concepts

The knowledge captured by BEL statements can be repre-sented as a graph The object can be a BEL term or a BEL state-ment Each statement can be associated with metadata thatcontextualizes it for example qualifying it to be true only inspecific tissues The BEL knowledge network can facilitate datainterpretation using a reverse causal reasoning approach [58]For example gene expression data can be mapped to smallersubnetworks that represent cause and effect models The extentto which the models can explain the measurements of differen-tially expressed genes can then be assessed and potentiallyyield insight into the mechanism

Hofmann-Apitius [5] posits that an understanding of diseasemechanisms involves integrating information at multiple

Navigating the disease landscape | 7

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

biological scales To some extent BEL offers a multiscale repre-sentation that other languages do not Using BEL Hofmann-Apitius and co-workers have explored mechanistic aspects ofAlzheimerrsquos disease [59ndash62] They used the BEL language to rep-resent information from the literature together with relevantexperimental data sets to build integrated evidence networksThis approach helps to mine what they refer to as the know-ledge lsquogrey zonersquo which includes both established and emerg-ing connections between entities This approach can bepowerful for linking mechanisms associated with biomarkers[60]

Constructing such information-rich representations can in-volve considerable manual curation efforts For this reason TM

will be increasingly important for supporting this process [63]Already existing examples include the BEL InformationExtraction workFlow (BELIEF) system and BELSmile [64] Theseframeworks work by using automated TM to detect BEL con-cepts and relationships and then allowing the curators to ex-plore and annotate their biological contexts to create final high-quality networks

Another example of a specialized curation data modellingformat is the Nanopublication standard [65] Nanopublicationdefines rules for publishing data in its simplest possible for-msmdashindividual assertions in the form of statements composedof a subject predicate and object plus supporting provenanceinformation and relevant metadata Nanopublication is

Figure 3 Significant pathways identified using Tied Diffusion through Interacting Events method an approach that can extract the most likely set of interconnectors

between two sets of nodes In this example core asthma genes from DisGeNET and severe differential expression signature were considered in the context of

Reactome regulatory pathway network In both panels colour intensity indicates weight magnitude assigned by the algorithm (A) Complete set of all significant genes

Dashed lines indicate undirected edges of lsquocomponentrsquo type whereas solid edges show directed interactions (B) subset of the network relating to the IL1RN gene

Arrow and bar-terminated edge styles show activation and inhibition respectively

8 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

designed to be used in combination with Semantic Web tech-nologies and uses RDF a format that allows seamless incorpor-ation of ontology terms and cross-references through globallyunique shared identifiers The advantages of this standard in-clude decentralized (federated) publishing machine-readablerepresentation of data and an ability to track provenance at thehighest possible level of granularity

Knowledge network representations of biological data are in-creasingly being leveraged in in integrative analysis platformswhich aim to combine information resources with commonanalysis and visualization tasks One example of such platformsis Ingenuity Pathway Analysis (IPAVR ) IPAVR suite aims to facili-tate interpretation of experimental data from high-throughputomics experiments and includes methods for identification ofcausal networks and visual exploration of biological graphs [66]The BioMax BioXMTM platform uses an object relational data-base architecture to store query and envision disease informa-tion including molecular genetic physiological and clinicaldata as well as biological models This platform has been usedto create a resource for chronic obstructive pulmonary disease(COPD) called lsquoCOPD knowledge basersquo [67ndash70] The authors ofCOPD knowledge base do acknowledge that much disease-related information remains hidden in the literature and alsorecognize the difficulties of capturing quality and context speci-ficity The Malacards resource [71] uses a relational database forthe semantic integration of multiple sources of informationabout diseases and also provides network visualizations of dis-easendashdisease relationships GeneAnalytics [72] enables context-ualization of functional signatures (gene sets) in the context oforgans and tissues and compartments as well as other annota-tion sources building on the GeneCards [73] and Malacardstechnologies

Recent advances in graph-based databases have opened upadditional possibilities for organizing heterogeneous and inter-linked data Graph databases have several features that makethem particularly attractive for management and exploratoryanalysis of complex biological data sets A particularly notewor-thy feature is excellent performance of traversal-type queriesacross entities of diverse types they offer Such queries could bechallenging to realize in relational databases because of the costof computing statements across different tables Traversalqueries can be an important way of identifying new relation-ships in integrated data particularly in the cases where linksonly become apparent once multiple sources have been inte-grated but may not be immediately obvious because of sheeramounts of data being involved Modern graph databases offersophisticated query languages and often also other types offramework-specific functionality to facilitate applied analysislike an ability to extend the framework with user-defined func-tions [74] and integration with ontologies and reasoning sys-tems One example of a graph database solution that has beenalready used in biomedical domain [75ndash78] is Neo4J DBMSNeo4J is a Java-based solution which also offers its own graphquery language (Cypher) which is conceptually similar to SQL

Semantic Web [79] RDF [80] and Linked Open Data [81] areemerging important standards for sharing biomedical data Atits core Semantic Web is characterized by its use of globallyunique unambiguous identifiers for entities The identifier usedfor this purpose is commonly a web link (URL) The RDF formatis structured as a set of statements with three parts (subjectpredicate and object) where a subject and object are entitieswith a unique identifier and predicate defines the type of a re-lationship between them Linked Open Data standard estab-lished a set of rules for providing meaningful RDF at the

resolvable identifier URL which generally is in the form of state-ments that provide meaningful information about that entitythat can in turn link it to other URL-identified entities and fi-nally the entities and statements can be linked to specializedontologies in OWL [82] that provide additional informationabout how to interpret them within a particular context

From the perspective of biomedical data integration RDFprimarily plays the role of a low-level data exchange standarduseful for addressing the syntactic heterogeneity issue andfacilitating data integration Although RDF representations canbe interpreted as integrated knowledge networks usually anadditional set of standards will be necessary for meaningful in-terpretation of data from a biological perspective Prominent re-sources that offer biological data on the Semantic Web includeEBI RDF platform [83] (covers such important databases likeUniProt [84] Reactome [85] Ensembl [86] ChEMBL [87] andExpression Atlas [88]) Bio2RDF (aims to mirror in RDF most ofthe prominent biological databases) [89] and OpenPhacts (usesSemantic Web technologies to represent the chemogenomicsspace) [90] Some noteworthy ontologies used in combinationwith these resources include Experimental Factor Ontology(EFO) [91] SBO [92] Translational Medicine Ontology (TMO) [93]SNOMED (medical terms and concepts) [94] and GO (functionalannotation of genes and proteins) [56]

To illustrate the application of integrated knowledge net-works with our example 22-gene severe asthma signature wehave used a Neo4j-based Hetionet v10 resource [95] that inte-grates a wide variety of biomedical information First a querywas prepared to visualize all of the diseases linked to the sig-nature genes (Figure 4) According to these results only onegene has already been linked to asthma though three othergenes have been linked to other lung diseases (COPD and idio-pathic pulmonary fibrosis) Next the connections betweenknown asthma-related pathways severe asthma differentialexpression signature and drugs targeting those common path-ways were retrieved In terms of query-specific degree of drugnodes (ie number of links to returned proteins) top threedrugs were dexamethasone betamethasone and niclosamideThe former two are corticosteroids already used for treatingasthma [96] and the last one was recently proposed as a highlypromising treatment because of its strong TMEM16A antagon-ism found to improve lung function in human and mousemodels [97]

Approaches for extraction ofcontextual knowledge

Ultimately all types of models in biomedical domain originatefrom data collected during relevant experiments and such datawould typically require both different levels of preprocessing andinterpretation to yield such models At present relatively fewtypes of raw experimental data can be straightforwardly con-verted into these representations without requiring human inter-pretation Some possible examples are construction of co-expression networks [98 99] from transcriptomic data and se-quence homology graphs [100 101] However currently manualinterpretation and publishing of inferred models as free text inscientific papers remains the primary means of disseminatingthis knowledge and all major biomedical databases discussed inthis review expend considerable curation efforts to collect verifyand consolidate it in standardized formats from source literature

Given the volume and ever-increasing rate of scientific pub-lishing automated means of facilitating this process are

Navigating the disease landscape | 9

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Figure 4 Query expressed in natural language and resulting output produced for the differential expression gene signature using Hetionet graph database The follow-

ing entities are shown proteins (blue) pathways (yellow) diseases (brown) and drugs (red) (A) Query for associated diseases (B) query to explore the connections be-

tween niclosamide drug asthma and differential expression signature linked via relevant proteins and pathways

10 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

becoming increasingly essential This is achieved through appli-cations of information retrieval (IR) and TM methods which canhandle tasks like identification of relevant documents sectionsand sentences recognition of concepts and relationships be-tween them or even generation of novel hypotheses [102]Introduction of computational text analysis methods into thecuration process has been reported to improve its efficiency bybetween 2- and 10-fold [103] Especially in the biomedical com-munity development of TM approaches has greatly benefitedfrom competitive evaluation challenges [104] Some prominentexamples of such initiatives include BioCreative TRECGenomicsMedicalCDS BioNLP-ST i2b2 and ShAReCLEFeHealth These regular competitions provide a valuable bench-mark for current state of the art for different tasks and serve tohighlight the most important problems focus community ef-forts and establish necessary annotated textual resources

As well as assisting with curation both TM and IR can beapplied in a high-throughput and automated manner howeverthe utility of these applications is still limited by accuracy andgeneralizability [103] In its simplest form various statistical co-occurrence methods can be applied to build networks linkingbiological concepts frequently found together in differentpapers Some examples of resources offering such literature-derived data are miRCancer that compiled data about co-occurrence of microRNA mentions with different cancer types[105] and DISEASES resource which provides similar informa-tion for gene mentions with particular diseases [106]

Extraction of information from text is contingent on unam-biguous identification of entities and relationship types fromwhich model statements can be constructed Therefore TMtechnologies have a natural synergy with ontology-based datamodelling and to a lesser extent Semantic Web formats likeRDF [102 107] The ontologies are used both as controlled vocab-ularies to resolve alternative naming conventions and as sour-ces of common identifiers for recovered concepts andrelationships Some examples of specialized relationship typesthat can be recovered include proteinndashprotein [108] drugndashpro-tein [109] and drugndashdrug [110 111] interactions and associ-ations of genes with mutations [112 113] and phenotypes [114ndash116] Data from IR and TM are increasingly incorporated along-side manually reviewed information by major databases for ex-ample PPIs extracted from literature by fully automatedapproaches are offered by STRING [13 117]

To handle more advanced tasks it is usually necessary tocombine specialized tools into TM workflows eg a hypothesisgeneration tool might rely on output of a relationships identifi-cation tool which in turn relies on named entity recognitionInevitably this raises questions of interoperability of differentanalysis methods One way to address this is has been throughincreasing use of Web services to link up different tools [118]Additionally specialized standards for biomedical textual datainterchange were developed for example XML-based BioCformat [119]

Recent efforts have been exploring the possibility of extract-ing more complex statements from text [120] and in 2015 theBioCreative V challenge has for the first time included a task ofautomated extraction of OpenBEL statements from text [121]During the competition the best result was achieved by theBELMiner framework [122] which used a rule-based approachfor extraction of BEL statements from text and got the highestF-measure of 20 for this task After the challenge BELSmilesystem was developed that reached an even better score of 28on the same dataset [64] BELSmile implemented a semantic-role-labelling approach which relies on recognition of verbs

(predicates) and assignment of roles to associated subjectsob-jects relative to that verb Notably designs of both frameworksintegrated multiple specialized IR and TM tools to handle spe-cific low-level subtasks

It is clear that at present a substantial proportion of valuablebiomedical knowledge is lsquotrappedrsquo in scientific text in a repre-sentation not readily suitable for automated computer-driveninterpretation Appropriate data formats and ontologies are anessential prerequisite for enabling sophisticated TM methods tobegin addressing this challenge The most recent developmentsindicate that current TM approaches are becoming powerfulenough to leverage such rich representations and are alreadyuseful to curators working on complex models of human dis-eases To explore further developments in this important areawe direct reader to the following recent reviews [25 102 123]

Outlook

Many translational and systems medicine projects involve col-lection of molecular and clinical data with a view to the identifi-cation of disease subtypes The data are typically warehousedand analysed for example to extract molecular fingerprintsthat characterize the subtypes and this data interpretation stepof the translational informatics pipeline remains highly chal-lenging Moving from diagnostic molecular patterns to beingable to suggest individualized healthcare pathways is complexData integration links the molecular patterns to backgroundknowledge which can then be reviewed explored and analysedto develop a more detailed understanding of the underlyingbiology that distinguishes disease subtypes

To support this process several standards have emerged tocapture integrate and facilitate analysis of these data This pro-liferation of standards suggests that different formalisms willcontinue to be necessary to model different aspects of biomed-ical data and that we are unlikely to converge on one ultimatemodelling solution in the foreseeable future The open questiontherefore becomes how to effectively manage the interoperabil-ity between these different representations In this respect sev-eral common trends and best practices are beginning to emergeIn particular it is now evident that development of commoncontrolled vocabularies and specialized domain ontologies is es-sential for effective management of increasingly complex bio-medical data Use of ontologies facilitates identification ofequivalent entities and concepts across different modelling for-mats and facilitates conversion between them allowing effi-cient data integration

As outlined by Cohen in [6] modern biomedical data model-ling formalisms can be categorized into lsquocurationrsquo and lsquomechanis-ticrsquo types Aim of the former is to primarily describe experimentalresults (eg Nanopublications BioPAX) including observed asso-ciations between molecular signatures and diseases The latterspecify evaluable models used to simulate living systems andgenerate predictions based on a set of inputs (eg SBML Kappa)It is obvious that although experimental observations are ultim-ately used to construct such models the processing stepsrequired to correctly generate the latter from the former wouldgo beyond a simple format change We believe that automationof this process is a highly promising emergent strategy for devel-opment of fully mechanistic models of disease Use of automatedmethods to support biomedical innovation is necessitated bothby the need to effectively deal with ever-growing volumes of dataand by the increasingly complex nature of biomedical researchitself The apparent diminishing returns in terms of noveldrugs successfully taken to market relative to the money spent

Navigating the disease landscape | 11

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

[124 125] with particular challenges for cardiovascular [126] andneurodegenerative diseases [127 128] mean that increasinglymore complex strategies for managing disease are needed tomake progress [129] Consequently high-throughput computa-tional approaches are becoming indispensable both for assistingwith interpretation management and mining of experimentaldata as well as for constructing and evaluating relevant clinicalmodels At present cutting-edge efforts in this area involve cre-ation of increasingly sophisticated automated methods for ana-lysis of vast quantities biomedical data exemplified by initiativeslike DARPArsquos Big Mechanism IBM Watson Health and Garudabiomedical discovery platforms The future modelling formatsare therefore likely to be influenced by the requirements of suchprojects and will increasingly incorporate features to supportfully computer-driven information analysis

In our view accurate handling of the context of biological ob-servations is critical for construction of correct mechanistic mod-els of disease Traditionally information has been compiled intolargely homogenized collections like pathway maps or globalinteractome networks that aim to show all possible processes of acertain type for a given organism However in practice only asmall subset of all these processes will be occurring in a real cellat a given time Furthermore such events may be transient (en-zymatic reactions) have stable outcome (protein complex forma-tion) or be mutually exclusive (alternative protein-bindingpartners) An additional level of complexity is introduced fromthe imperfection of experimental techniques For example tran-scriptomics profiling is often done at a tissue level which maymask important differences at a level of individual cells As suchmechanistic models built from these data may be subject to con-siderable uncertainty and abstraction The next generation of for-mats and integration solutions will therefore need to be morecontext aware Future progress on one hand requires bettermeans of expressing this highly granular context of biologicalprocesses and on the other necessitates solutions to model vastvolumes of data and knowledge in such form Once efficient andaccurate approaches for these tasks are established cutting-edgemathematical and machine learning methods could be leveragedto their full potential to give new insight into disease mechan-isms and suggest improved avenues for therapeutic intervention

Key Points

bull Precision medicine relies on identification of disease sub-types at a molecular level and linking them to diagnosticmodels to determine optimal treatment strategies

bull In practice establishing a contextual link between molecularsignatures and disease processes requires integration ofqualitatively diverse types of relevant data and knowledge

bull At present several alternative philosophies guide theprocess of transformation of experimental data intomechanistically informative and ultimately clinicallyactionable insights

bull The spectrum of possible approaches ranges frompurely associative knowledge link discovery to fullyquantitative mathematical models

Supplementary Data

Supplementary data are available online at httpsacademicoupcombib

Funding

The research leading to these results has received supportfrom the Innovative Medicines Initiative Joint UndertakingeTRIKS Project (Grant Number IMI 115446) resources ofwhich are composed of a financial contribution from theEuropean Unionrsquos Seventh Framework Programme (FP72007-2013) and European Federation of PharmaceuticalIndustries and Associations (EFPIA) companiesrsquo in kind con-tributions The work was also partially supported by CoreResearch for Evolutional Science and Technology (CREST)Grant from the Japan Science and Technology Agency(Grant Number JPMJCR1412) and Japan Society for thePromotion of Science KAKENHI (Grant Numbers 17H06307and 17H06299)

References1 Robinson PN Deep phenotyping for precision medicine

Hum Mutat 201233(5)777ndash802 Glaab E Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen classificationBrief Bioinform 201617(3)440ndash52

3 Woodruff PG Boushey HA Dolganov GM et al Genome-wide profiling identifies epithelial cell genes associated withasthma and with treatment response to corticosteroids ProcNatl Acad Sci USA 2007104(40)15858ndash63

4 Kaneko Y Yatagai Y Yamada H et al The search for com-mon pathways underlying asthma and COPD Int J ChronObstruct Pulmon Dis 2013865ndash78

5 Hofmann-Apitius M Ball G Gebel S et al Bioinformaticsmining and modeling methods for the identification of dis-ease mechanisms in neurodegenerative disorders Int J MolSci 201516(12)29179ndash206

6 Cohen PR DARPArsquos big mechanism program Phys Biol 201512(4)045008

7 Barabasi AL Network Science Cambridge United KingdomCambridge University Press 2016

8 Le Novere N Quantitative and logic modelling of molecularand gene networks Nat Rev Genet 201516(3)146ndash58

9 Boccaletti S Bianconi G Criado R et al The structure and dy-namics of multilayer networks Phys Rep 2014544(1)1ndash122

10 Vidal M Cusick ME Barabasi AL Interactome networks andhuman disease Cell 2011144(6)986ndash98

11 Pi~nero J Queralt-Rosinach N Bravo A et al DisGeNET a dis-covery platform for the dynamical exploration of humandiseases and their genes Database 20152015bav028

12 Voraphani N Gladwin M Contreras A et al An airway epi-thelial iNOSndashDUOX2ndashthyroid peroxidase metabolome drivesTh1Th2 nitrative stress in human severe asthma MucosalImmunol 20147(5)1175ndash85

13 Franceschini A Szklarczyk D Frankild S et al STRING v9 1protein-protein interaction networks with increasedcoverage and integration Nucleic Acids Res 201241(D1)D808ndash15

14 Montojo J Zuberi K Rodriguez H et al GeneMANIA fastgene network construction and function prediction forCytoscape F1000Res 20143153

15 Goh KI Cusick ME Valle D et al The human disease net-work Proc Natl Acad Sci USA 2007104(21)8685ndash90

16 Menche J Sharma A Kitsak M et al Uncovering disease-disease relationships through the incomplete interactomeScience 2015347(6224)1257601

12 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 7: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

to be generic further subcategories of these concepts are notdefined as part of the standard and instead particular em-phasis is placed on enabling mathematical definition of themodel (eg in the form of ordinary differential equations) Someof the necessary biological underpinnings are added via arelated project SBGN [41] standard which defines how differentSBML entities can be represented graphically In SBML fine-grained annotation or specification of complex contexts is pos-sible through special fields set aside for user data However asthese fields are not bound by the standard they may not besupported by all of the tools and therefore generally would notbe used in evaluation of the model

The SBGN PD language [41] was designed to give an unam-biguous representation of mechanism of action within a biolo-gical pathway that can be interpreted visually to facilitatemanual exploration The PD language can represent all thesteps in for example a reaction showing how a biological entitychanges in a temporal manner However to fully exploit the PDrepresentation for applications like numerical modelling wouldrequire additional data associated with component steps (suchas rate constants) which often is not readily available

Another prominent data exchange standard is BioPAX [49]which was primarily designed for unambiguous sharing of bio-chemical pathway data and therefore offers much wider selectionof domain-specific terms to characterize and categorize bothinteractors and interactions As Ontology Web Language (OWL)forms the core of this standard all of the terms are organized hier-archically and relationships between them are formally definedBioPAX standard aims to facilitate data sharing and integrationvia Semantic Web technologies and Resource Description Format(RDF) If an RDF representation is used the format can be easilyextended to incorporate complex context information howeverBioPAX currently does not aim to offer the means of defining ex-perimental or biomedical context for pathways as by designother standardsontologies can be brought in to model them

Both SMBL and BioPAX are predicated on the developmentsof the past decade when several key technologies and designprinciples were introduced that greatly influenced how biomed-ical data-sharing standards are implemented One of these de-velopments was increasing adoption of formal ontologies thatallowed development of well-structured controlled vocabulariesfor different domains and definition of cross-mappings betweenthem For example the SBML standard allows annotation ofmodels with System Biology Ontology (SBO) terms which arealso used in SBGN and can be mapped on to BioPax via SBPAX[50] Therefore in principle it is possible to integrate modelsacross all of these standards on a qualitative level On a syntac-tic level different formalisms can be integrated by introducingsupport for a low-level common language like RDF which isrepresentation-agnostic and allows all types of data to be linkedvia common identifiers Both of these solutions enable modernstandards to be designed in a modular way where differentspecialized representations can be developed by different com-munities of domain experts and then combined togethermdashoreven extended for the needs of a particular applied project

In recent years the community efforts have been increas-ingly devoted to increasing inter-compatibility across all mainbiomedical standards These efforts include development ofconverters controlled vocabulary mappings and better supportof different standards both by data providers and analytical tooldevelopers Greater compatibly also means that smaller models(eg from individual studies) can now be meaningfully sharedand combined One prominent resource that aims to facilitatethis process is NDex [51] NDex allows researchers to share their

models in a variety of different formats which can then be useddirectly by using the NDex website like a data warehouse orintegrated into a common representation by using the NDexCytoscape [52] plug-in

Pathway-based representations enable relevant methods toincorporate the directionality and interaction type (eg up- ordown-regulation) and in this way they can capture causal as-sociations underlying different disease mechanisms To illus-trate this application we have used lsquoTied Diffusion throughInteracting Eventsrsquo method [53] which can identify significantlyimplicated pathways connecting two sets of genes (in this caseDisGeNET asthma signature and our severe asthma signature)The diffusion model used by this method can take into accountmagnitude of effects direction and type of interactions Afterapplying this approach the regulatory subnetwork fromReactome found to be significant for these two sets wasvisualized in Cytoscape (Figure 3) It is possible to see that twointerleukin genes (both involved airway inflammation [54])were found to be particularly critical with lower panel showingkey associations involved

Contextualization by knowledgegraph representations

There are many specialist sources of disease-related informa-tion such as databases of mutations and their consequences Inaddition there is vast information in unstructured formats inthe scientific literature and much of it is contextual for ex-ample statements that describe correlations between geneproducts or between small molecules and gene products thatare observed in particular cell types Information in the litera-ture may also describe weak associations only seen in certainconditions or for which there may be limited evidence thatcould nevertheless still be useful for hypothesis generation

The Biological Expression Language (BEL) [55] representsstatements in biology typically correlative and causative state-ments as well as the context in which these statements applyA biological statement is represented as a triple (subject predi-cate and object) where the subject is a BEL term and the predi-cate is a biological relationship that connects the subject withobject A BEL term can be the abundance of a biological entity orit can be a biological process such as one described in the GeneOntology (GO) [56] or even a given disease condition Each en-tity is described in an existing namespace (eg from databaseslike ChEBI [57]) or in particular user-defined namespace(s)There is a fixed set of causal correlative and other relationshipsthat link these entities The BEL complier checks that the state-ments are syntactically correct and carries out the integrationand alignment of the data by establishing equivalences betweenidentical concepts

The knowledge captured by BEL statements can be repre-sented as a graph The object can be a BEL term or a BEL state-ment Each statement can be associated with metadata thatcontextualizes it for example qualifying it to be true only inspecific tissues The BEL knowledge network can facilitate datainterpretation using a reverse causal reasoning approach [58]For example gene expression data can be mapped to smallersubnetworks that represent cause and effect models The extentto which the models can explain the measurements of differen-tially expressed genes can then be assessed and potentiallyyield insight into the mechanism

Hofmann-Apitius [5] posits that an understanding of diseasemechanisms involves integrating information at multiple

Navigating the disease landscape | 7

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

biological scales To some extent BEL offers a multiscale repre-sentation that other languages do not Using BEL Hofmann-Apitius and co-workers have explored mechanistic aspects ofAlzheimerrsquos disease [59ndash62] They used the BEL language to rep-resent information from the literature together with relevantexperimental data sets to build integrated evidence networksThis approach helps to mine what they refer to as the know-ledge lsquogrey zonersquo which includes both established and emerg-ing connections between entities This approach can bepowerful for linking mechanisms associated with biomarkers[60]

Constructing such information-rich representations can in-volve considerable manual curation efforts For this reason TM

will be increasingly important for supporting this process [63]Already existing examples include the BEL InformationExtraction workFlow (BELIEF) system and BELSmile [64] Theseframeworks work by using automated TM to detect BEL con-cepts and relationships and then allowing the curators to ex-plore and annotate their biological contexts to create final high-quality networks

Another example of a specialized curation data modellingformat is the Nanopublication standard [65] Nanopublicationdefines rules for publishing data in its simplest possible for-msmdashindividual assertions in the form of statements composedof a subject predicate and object plus supporting provenanceinformation and relevant metadata Nanopublication is

Figure 3 Significant pathways identified using Tied Diffusion through Interacting Events method an approach that can extract the most likely set of interconnectors

between two sets of nodes In this example core asthma genes from DisGeNET and severe differential expression signature were considered in the context of

Reactome regulatory pathway network In both panels colour intensity indicates weight magnitude assigned by the algorithm (A) Complete set of all significant genes

Dashed lines indicate undirected edges of lsquocomponentrsquo type whereas solid edges show directed interactions (B) subset of the network relating to the IL1RN gene

Arrow and bar-terminated edge styles show activation and inhibition respectively

8 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

designed to be used in combination with Semantic Web tech-nologies and uses RDF a format that allows seamless incorpor-ation of ontology terms and cross-references through globallyunique shared identifiers The advantages of this standard in-clude decentralized (federated) publishing machine-readablerepresentation of data and an ability to track provenance at thehighest possible level of granularity

Knowledge network representations of biological data are in-creasingly being leveraged in in integrative analysis platformswhich aim to combine information resources with commonanalysis and visualization tasks One example of such platformsis Ingenuity Pathway Analysis (IPAVR ) IPAVR suite aims to facili-tate interpretation of experimental data from high-throughputomics experiments and includes methods for identification ofcausal networks and visual exploration of biological graphs [66]The BioMax BioXMTM platform uses an object relational data-base architecture to store query and envision disease informa-tion including molecular genetic physiological and clinicaldata as well as biological models This platform has been usedto create a resource for chronic obstructive pulmonary disease(COPD) called lsquoCOPD knowledge basersquo [67ndash70] The authors ofCOPD knowledge base do acknowledge that much disease-related information remains hidden in the literature and alsorecognize the difficulties of capturing quality and context speci-ficity The Malacards resource [71] uses a relational database forthe semantic integration of multiple sources of informationabout diseases and also provides network visualizations of dis-easendashdisease relationships GeneAnalytics [72] enables context-ualization of functional signatures (gene sets) in the context oforgans and tissues and compartments as well as other annota-tion sources building on the GeneCards [73] and Malacardstechnologies

Recent advances in graph-based databases have opened upadditional possibilities for organizing heterogeneous and inter-linked data Graph databases have several features that makethem particularly attractive for management and exploratoryanalysis of complex biological data sets A particularly notewor-thy feature is excellent performance of traversal-type queriesacross entities of diverse types they offer Such queries could bechallenging to realize in relational databases because of the costof computing statements across different tables Traversalqueries can be an important way of identifying new relation-ships in integrated data particularly in the cases where linksonly become apparent once multiple sources have been inte-grated but may not be immediately obvious because of sheeramounts of data being involved Modern graph databases offersophisticated query languages and often also other types offramework-specific functionality to facilitate applied analysislike an ability to extend the framework with user-defined func-tions [74] and integration with ontologies and reasoning sys-tems One example of a graph database solution that has beenalready used in biomedical domain [75ndash78] is Neo4J DBMSNeo4J is a Java-based solution which also offers its own graphquery language (Cypher) which is conceptually similar to SQL

Semantic Web [79] RDF [80] and Linked Open Data [81] areemerging important standards for sharing biomedical data Atits core Semantic Web is characterized by its use of globallyunique unambiguous identifiers for entities The identifier usedfor this purpose is commonly a web link (URL) The RDF formatis structured as a set of statements with three parts (subjectpredicate and object) where a subject and object are entitieswith a unique identifier and predicate defines the type of a re-lationship between them Linked Open Data standard estab-lished a set of rules for providing meaningful RDF at the

resolvable identifier URL which generally is in the form of state-ments that provide meaningful information about that entitythat can in turn link it to other URL-identified entities and fi-nally the entities and statements can be linked to specializedontologies in OWL [82] that provide additional informationabout how to interpret them within a particular context

From the perspective of biomedical data integration RDFprimarily plays the role of a low-level data exchange standarduseful for addressing the syntactic heterogeneity issue andfacilitating data integration Although RDF representations canbe interpreted as integrated knowledge networks usually anadditional set of standards will be necessary for meaningful in-terpretation of data from a biological perspective Prominent re-sources that offer biological data on the Semantic Web includeEBI RDF platform [83] (covers such important databases likeUniProt [84] Reactome [85] Ensembl [86] ChEMBL [87] andExpression Atlas [88]) Bio2RDF (aims to mirror in RDF most ofthe prominent biological databases) [89] and OpenPhacts (usesSemantic Web technologies to represent the chemogenomicsspace) [90] Some noteworthy ontologies used in combinationwith these resources include Experimental Factor Ontology(EFO) [91] SBO [92] Translational Medicine Ontology (TMO) [93]SNOMED (medical terms and concepts) [94] and GO (functionalannotation of genes and proteins) [56]

To illustrate the application of integrated knowledge net-works with our example 22-gene severe asthma signature wehave used a Neo4j-based Hetionet v10 resource [95] that inte-grates a wide variety of biomedical information First a querywas prepared to visualize all of the diseases linked to the sig-nature genes (Figure 4) According to these results only onegene has already been linked to asthma though three othergenes have been linked to other lung diseases (COPD and idio-pathic pulmonary fibrosis) Next the connections betweenknown asthma-related pathways severe asthma differentialexpression signature and drugs targeting those common path-ways were retrieved In terms of query-specific degree of drugnodes (ie number of links to returned proteins) top threedrugs were dexamethasone betamethasone and niclosamideThe former two are corticosteroids already used for treatingasthma [96] and the last one was recently proposed as a highlypromising treatment because of its strong TMEM16A antagon-ism found to improve lung function in human and mousemodels [97]

Approaches for extraction ofcontextual knowledge

Ultimately all types of models in biomedical domain originatefrom data collected during relevant experiments and such datawould typically require both different levels of preprocessing andinterpretation to yield such models At present relatively fewtypes of raw experimental data can be straightforwardly con-verted into these representations without requiring human inter-pretation Some possible examples are construction of co-expression networks [98 99] from transcriptomic data and se-quence homology graphs [100 101] However currently manualinterpretation and publishing of inferred models as free text inscientific papers remains the primary means of disseminatingthis knowledge and all major biomedical databases discussed inthis review expend considerable curation efforts to collect verifyand consolidate it in standardized formats from source literature

Given the volume and ever-increasing rate of scientific pub-lishing automated means of facilitating this process are

Navigating the disease landscape | 9

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Figure 4 Query expressed in natural language and resulting output produced for the differential expression gene signature using Hetionet graph database The follow-

ing entities are shown proteins (blue) pathways (yellow) diseases (brown) and drugs (red) (A) Query for associated diseases (B) query to explore the connections be-

tween niclosamide drug asthma and differential expression signature linked via relevant proteins and pathways

10 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

becoming increasingly essential This is achieved through appli-cations of information retrieval (IR) and TM methods which canhandle tasks like identification of relevant documents sectionsand sentences recognition of concepts and relationships be-tween them or even generation of novel hypotheses [102]Introduction of computational text analysis methods into thecuration process has been reported to improve its efficiency bybetween 2- and 10-fold [103] Especially in the biomedical com-munity development of TM approaches has greatly benefitedfrom competitive evaluation challenges [104] Some prominentexamples of such initiatives include BioCreative TRECGenomicsMedicalCDS BioNLP-ST i2b2 and ShAReCLEFeHealth These regular competitions provide a valuable bench-mark for current state of the art for different tasks and serve tohighlight the most important problems focus community ef-forts and establish necessary annotated textual resources

As well as assisting with curation both TM and IR can beapplied in a high-throughput and automated manner howeverthe utility of these applications is still limited by accuracy andgeneralizability [103] In its simplest form various statistical co-occurrence methods can be applied to build networks linkingbiological concepts frequently found together in differentpapers Some examples of resources offering such literature-derived data are miRCancer that compiled data about co-occurrence of microRNA mentions with different cancer types[105] and DISEASES resource which provides similar informa-tion for gene mentions with particular diseases [106]

Extraction of information from text is contingent on unam-biguous identification of entities and relationship types fromwhich model statements can be constructed Therefore TMtechnologies have a natural synergy with ontology-based datamodelling and to a lesser extent Semantic Web formats likeRDF [102 107] The ontologies are used both as controlled vocab-ularies to resolve alternative naming conventions and as sour-ces of common identifiers for recovered concepts andrelationships Some examples of specialized relationship typesthat can be recovered include proteinndashprotein [108] drugndashpro-tein [109] and drugndashdrug [110 111] interactions and associ-ations of genes with mutations [112 113] and phenotypes [114ndash116] Data from IR and TM are increasingly incorporated along-side manually reviewed information by major databases for ex-ample PPIs extracted from literature by fully automatedapproaches are offered by STRING [13 117]

To handle more advanced tasks it is usually necessary tocombine specialized tools into TM workflows eg a hypothesisgeneration tool might rely on output of a relationships identifi-cation tool which in turn relies on named entity recognitionInevitably this raises questions of interoperability of differentanalysis methods One way to address this is has been throughincreasing use of Web services to link up different tools [118]Additionally specialized standards for biomedical textual datainterchange were developed for example XML-based BioCformat [119]

Recent efforts have been exploring the possibility of extract-ing more complex statements from text [120] and in 2015 theBioCreative V challenge has for the first time included a task ofautomated extraction of OpenBEL statements from text [121]During the competition the best result was achieved by theBELMiner framework [122] which used a rule-based approachfor extraction of BEL statements from text and got the highestF-measure of 20 for this task After the challenge BELSmilesystem was developed that reached an even better score of 28on the same dataset [64] BELSmile implemented a semantic-role-labelling approach which relies on recognition of verbs

(predicates) and assignment of roles to associated subjectsob-jects relative to that verb Notably designs of both frameworksintegrated multiple specialized IR and TM tools to handle spe-cific low-level subtasks

It is clear that at present a substantial proportion of valuablebiomedical knowledge is lsquotrappedrsquo in scientific text in a repre-sentation not readily suitable for automated computer-driveninterpretation Appropriate data formats and ontologies are anessential prerequisite for enabling sophisticated TM methods tobegin addressing this challenge The most recent developmentsindicate that current TM approaches are becoming powerfulenough to leverage such rich representations and are alreadyuseful to curators working on complex models of human dis-eases To explore further developments in this important areawe direct reader to the following recent reviews [25 102 123]

Outlook

Many translational and systems medicine projects involve col-lection of molecular and clinical data with a view to the identifi-cation of disease subtypes The data are typically warehousedand analysed for example to extract molecular fingerprintsthat characterize the subtypes and this data interpretation stepof the translational informatics pipeline remains highly chal-lenging Moving from diagnostic molecular patterns to beingable to suggest individualized healthcare pathways is complexData integration links the molecular patterns to backgroundknowledge which can then be reviewed explored and analysedto develop a more detailed understanding of the underlyingbiology that distinguishes disease subtypes

To support this process several standards have emerged tocapture integrate and facilitate analysis of these data This pro-liferation of standards suggests that different formalisms willcontinue to be necessary to model different aspects of biomed-ical data and that we are unlikely to converge on one ultimatemodelling solution in the foreseeable future The open questiontherefore becomes how to effectively manage the interoperabil-ity between these different representations In this respect sev-eral common trends and best practices are beginning to emergeIn particular it is now evident that development of commoncontrolled vocabularies and specialized domain ontologies is es-sential for effective management of increasingly complex bio-medical data Use of ontologies facilitates identification ofequivalent entities and concepts across different modelling for-mats and facilitates conversion between them allowing effi-cient data integration

As outlined by Cohen in [6] modern biomedical data model-ling formalisms can be categorized into lsquocurationrsquo and lsquomechanis-ticrsquo types Aim of the former is to primarily describe experimentalresults (eg Nanopublications BioPAX) including observed asso-ciations between molecular signatures and diseases The latterspecify evaluable models used to simulate living systems andgenerate predictions based on a set of inputs (eg SBML Kappa)It is obvious that although experimental observations are ultim-ately used to construct such models the processing stepsrequired to correctly generate the latter from the former wouldgo beyond a simple format change We believe that automationof this process is a highly promising emergent strategy for devel-opment of fully mechanistic models of disease Use of automatedmethods to support biomedical innovation is necessitated bothby the need to effectively deal with ever-growing volumes of dataand by the increasingly complex nature of biomedical researchitself The apparent diminishing returns in terms of noveldrugs successfully taken to market relative to the money spent

Navigating the disease landscape | 11

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

[124 125] with particular challenges for cardiovascular [126] andneurodegenerative diseases [127 128] mean that increasinglymore complex strategies for managing disease are needed tomake progress [129] Consequently high-throughput computa-tional approaches are becoming indispensable both for assistingwith interpretation management and mining of experimentaldata as well as for constructing and evaluating relevant clinicalmodels At present cutting-edge efforts in this area involve cre-ation of increasingly sophisticated automated methods for ana-lysis of vast quantities biomedical data exemplified by initiativeslike DARPArsquos Big Mechanism IBM Watson Health and Garudabiomedical discovery platforms The future modelling formatsare therefore likely to be influenced by the requirements of suchprojects and will increasingly incorporate features to supportfully computer-driven information analysis

In our view accurate handling of the context of biological ob-servations is critical for construction of correct mechanistic mod-els of disease Traditionally information has been compiled intolargely homogenized collections like pathway maps or globalinteractome networks that aim to show all possible processes of acertain type for a given organism However in practice only asmall subset of all these processes will be occurring in a real cellat a given time Furthermore such events may be transient (en-zymatic reactions) have stable outcome (protein complex forma-tion) or be mutually exclusive (alternative protein-bindingpartners) An additional level of complexity is introduced fromthe imperfection of experimental techniques For example tran-scriptomics profiling is often done at a tissue level which maymask important differences at a level of individual cells As suchmechanistic models built from these data may be subject to con-siderable uncertainty and abstraction The next generation of for-mats and integration solutions will therefore need to be morecontext aware Future progress on one hand requires bettermeans of expressing this highly granular context of biologicalprocesses and on the other necessitates solutions to model vastvolumes of data and knowledge in such form Once efficient andaccurate approaches for these tasks are established cutting-edgemathematical and machine learning methods could be leveragedto their full potential to give new insight into disease mechan-isms and suggest improved avenues for therapeutic intervention

Key Points

bull Precision medicine relies on identification of disease sub-types at a molecular level and linking them to diagnosticmodels to determine optimal treatment strategies

bull In practice establishing a contextual link between molecularsignatures and disease processes requires integration ofqualitatively diverse types of relevant data and knowledge

bull At present several alternative philosophies guide theprocess of transformation of experimental data intomechanistically informative and ultimately clinicallyactionable insights

bull The spectrum of possible approaches ranges frompurely associative knowledge link discovery to fullyquantitative mathematical models

Supplementary Data

Supplementary data are available online at httpsacademicoupcombib

Funding

The research leading to these results has received supportfrom the Innovative Medicines Initiative Joint UndertakingeTRIKS Project (Grant Number IMI 115446) resources ofwhich are composed of a financial contribution from theEuropean Unionrsquos Seventh Framework Programme (FP72007-2013) and European Federation of PharmaceuticalIndustries and Associations (EFPIA) companiesrsquo in kind con-tributions The work was also partially supported by CoreResearch for Evolutional Science and Technology (CREST)Grant from the Japan Science and Technology Agency(Grant Number JPMJCR1412) and Japan Society for thePromotion of Science KAKENHI (Grant Numbers 17H06307and 17H06299)

References1 Robinson PN Deep phenotyping for precision medicine

Hum Mutat 201233(5)777ndash802 Glaab E Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen classificationBrief Bioinform 201617(3)440ndash52

3 Woodruff PG Boushey HA Dolganov GM et al Genome-wide profiling identifies epithelial cell genes associated withasthma and with treatment response to corticosteroids ProcNatl Acad Sci USA 2007104(40)15858ndash63

4 Kaneko Y Yatagai Y Yamada H et al The search for com-mon pathways underlying asthma and COPD Int J ChronObstruct Pulmon Dis 2013865ndash78

5 Hofmann-Apitius M Ball G Gebel S et al Bioinformaticsmining and modeling methods for the identification of dis-ease mechanisms in neurodegenerative disorders Int J MolSci 201516(12)29179ndash206

6 Cohen PR DARPArsquos big mechanism program Phys Biol 201512(4)045008

7 Barabasi AL Network Science Cambridge United KingdomCambridge University Press 2016

8 Le Novere N Quantitative and logic modelling of molecularand gene networks Nat Rev Genet 201516(3)146ndash58

9 Boccaletti S Bianconi G Criado R et al The structure and dy-namics of multilayer networks Phys Rep 2014544(1)1ndash122

10 Vidal M Cusick ME Barabasi AL Interactome networks andhuman disease Cell 2011144(6)986ndash98

11 Pi~nero J Queralt-Rosinach N Bravo A et al DisGeNET a dis-covery platform for the dynamical exploration of humandiseases and their genes Database 20152015bav028

12 Voraphani N Gladwin M Contreras A et al An airway epi-thelial iNOSndashDUOX2ndashthyroid peroxidase metabolome drivesTh1Th2 nitrative stress in human severe asthma MucosalImmunol 20147(5)1175ndash85

13 Franceschini A Szklarczyk D Frankild S et al STRING v9 1protein-protein interaction networks with increasedcoverage and integration Nucleic Acids Res 201241(D1)D808ndash15

14 Montojo J Zuberi K Rodriguez H et al GeneMANIA fastgene network construction and function prediction forCytoscape F1000Res 20143153

15 Goh KI Cusick ME Valle D et al The human disease net-work Proc Natl Acad Sci USA 2007104(21)8685ndash90

16 Menche J Sharma A Kitsak M et al Uncovering disease-disease relationships through the incomplete interactomeScience 2015347(6224)1257601

12 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 8: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

biological scales To some extent BEL offers a multiscale repre-sentation that other languages do not Using BEL Hofmann-Apitius and co-workers have explored mechanistic aspects ofAlzheimerrsquos disease [59ndash62] They used the BEL language to rep-resent information from the literature together with relevantexperimental data sets to build integrated evidence networksThis approach helps to mine what they refer to as the know-ledge lsquogrey zonersquo which includes both established and emerg-ing connections between entities This approach can bepowerful for linking mechanisms associated with biomarkers[60]

Constructing such information-rich representations can in-volve considerable manual curation efforts For this reason TM

will be increasingly important for supporting this process [63]Already existing examples include the BEL InformationExtraction workFlow (BELIEF) system and BELSmile [64] Theseframeworks work by using automated TM to detect BEL con-cepts and relationships and then allowing the curators to ex-plore and annotate their biological contexts to create final high-quality networks

Another example of a specialized curation data modellingformat is the Nanopublication standard [65] Nanopublicationdefines rules for publishing data in its simplest possible for-msmdashindividual assertions in the form of statements composedof a subject predicate and object plus supporting provenanceinformation and relevant metadata Nanopublication is

Figure 3 Significant pathways identified using Tied Diffusion through Interacting Events method an approach that can extract the most likely set of interconnectors

between two sets of nodes In this example core asthma genes from DisGeNET and severe differential expression signature were considered in the context of

Reactome regulatory pathway network In both panels colour intensity indicates weight magnitude assigned by the algorithm (A) Complete set of all significant genes

Dashed lines indicate undirected edges of lsquocomponentrsquo type whereas solid edges show directed interactions (B) subset of the network relating to the IL1RN gene

Arrow and bar-terminated edge styles show activation and inhibition respectively

8 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

designed to be used in combination with Semantic Web tech-nologies and uses RDF a format that allows seamless incorpor-ation of ontology terms and cross-references through globallyunique shared identifiers The advantages of this standard in-clude decentralized (federated) publishing machine-readablerepresentation of data and an ability to track provenance at thehighest possible level of granularity

Knowledge network representations of biological data are in-creasingly being leveraged in in integrative analysis platformswhich aim to combine information resources with commonanalysis and visualization tasks One example of such platformsis Ingenuity Pathway Analysis (IPAVR ) IPAVR suite aims to facili-tate interpretation of experimental data from high-throughputomics experiments and includes methods for identification ofcausal networks and visual exploration of biological graphs [66]The BioMax BioXMTM platform uses an object relational data-base architecture to store query and envision disease informa-tion including molecular genetic physiological and clinicaldata as well as biological models This platform has been usedto create a resource for chronic obstructive pulmonary disease(COPD) called lsquoCOPD knowledge basersquo [67ndash70] The authors ofCOPD knowledge base do acknowledge that much disease-related information remains hidden in the literature and alsorecognize the difficulties of capturing quality and context speci-ficity The Malacards resource [71] uses a relational database forthe semantic integration of multiple sources of informationabout diseases and also provides network visualizations of dis-easendashdisease relationships GeneAnalytics [72] enables context-ualization of functional signatures (gene sets) in the context oforgans and tissues and compartments as well as other annota-tion sources building on the GeneCards [73] and Malacardstechnologies

Recent advances in graph-based databases have opened upadditional possibilities for organizing heterogeneous and inter-linked data Graph databases have several features that makethem particularly attractive for management and exploratoryanalysis of complex biological data sets A particularly notewor-thy feature is excellent performance of traversal-type queriesacross entities of diverse types they offer Such queries could bechallenging to realize in relational databases because of the costof computing statements across different tables Traversalqueries can be an important way of identifying new relation-ships in integrated data particularly in the cases where linksonly become apparent once multiple sources have been inte-grated but may not be immediately obvious because of sheeramounts of data being involved Modern graph databases offersophisticated query languages and often also other types offramework-specific functionality to facilitate applied analysislike an ability to extend the framework with user-defined func-tions [74] and integration with ontologies and reasoning sys-tems One example of a graph database solution that has beenalready used in biomedical domain [75ndash78] is Neo4J DBMSNeo4J is a Java-based solution which also offers its own graphquery language (Cypher) which is conceptually similar to SQL

Semantic Web [79] RDF [80] and Linked Open Data [81] areemerging important standards for sharing biomedical data Atits core Semantic Web is characterized by its use of globallyunique unambiguous identifiers for entities The identifier usedfor this purpose is commonly a web link (URL) The RDF formatis structured as a set of statements with three parts (subjectpredicate and object) where a subject and object are entitieswith a unique identifier and predicate defines the type of a re-lationship between them Linked Open Data standard estab-lished a set of rules for providing meaningful RDF at the

resolvable identifier URL which generally is in the form of state-ments that provide meaningful information about that entitythat can in turn link it to other URL-identified entities and fi-nally the entities and statements can be linked to specializedontologies in OWL [82] that provide additional informationabout how to interpret them within a particular context

From the perspective of biomedical data integration RDFprimarily plays the role of a low-level data exchange standarduseful for addressing the syntactic heterogeneity issue andfacilitating data integration Although RDF representations canbe interpreted as integrated knowledge networks usually anadditional set of standards will be necessary for meaningful in-terpretation of data from a biological perspective Prominent re-sources that offer biological data on the Semantic Web includeEBI RDF platform [83] (covers such important databases likeUniProt [84] Reactome [85] Ensembl [86] ChEMBL [87] andExpression Atlas [88]) Bio2RDF (aims to mirror in RDF most ofthe prominent biological databases) [89] and OpenPhacts (usesSemantic Web technologies to represent the chemogenomicsspace) [90] Some noteworthy ontologies used in combinationwith these resources include Experimental Factor Ontology(EFO) [91] SBO [92] Translational Medicine Ontology (TMO) [93]SNOMED (medical terms and concepts) [94] and GO (functionalannotation of genes and proteins) [56]

To illustrate the application of integrated knowledge net-works with our example 22-gene severe asthma signature wehave used a Neo4j-based Hetionet v10 resource [95] that inte-grates a wide variety of biomedical information First a querywas prepared to visualize all of the diseases linked to the sig-nature genes (Figure 4) According to these results only onegene has already been linked to asthma though three othergenes have been linked to other lung diseases (COPD and idio-pathic pulmonary fibrosis) Next the connections betweenknown asthma-related pathways severe asthma differentialexpression signature and drugs targeting those common path-ways were retrieved In terms of query-specific degree of drugnodes (ie number of links to returned proteins) top threedrugs were dexamethasone betamethasone and niclosamideThe former two are corticosteroids already used for treatingasthma [96] and the last one was recently proposed as a highlypromising treatment because of its strong TMEM16A antagon-ism found to improve lung function in human and mousemodels [97]

Approaches for extraction ofcontextual knowledge

Ultimately all types of models in biomedical domain originatefrom data collected during relevant experiments and such datawould typically require both different levels of preprocessing andinterpretation to yield such models At present relatively fewtypes of raw experimental data can be straightforwardly con-verted into these representations without requiring human inter-pretation Some possible examples are construction of co-expression networks [98 99] from transcriptomic data and se-quence homology graphs [100 101] However currently manualinterpretation and publishing of inferred models as free text inscientific papers remains the primary means of disseminatingthis knowledge and all major biomedical databases discussed inthis review expend considerable curation efforts to collect verifyand consolidate it in standardized formats from source literature

Given the volume and ever-increasing rate of scientific pub-lishing automated means of facilitating this process are

Navigating the disease landscape | 9

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Figure 4 Query expressed in natural language and resulting output produced for the differential expression gene signature using Hetionet graph database The follow-

ing entities are shown proteins (blue) pathways (yellow) diseases (brown) and drugs (red) (A) Query for associated diseases (B) query to explore the connections be-

tween niclosamide drug asthma and differential expression signature linked via relevant proteins and pathways

10 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

becoming increasingly essential This is achieved through appli-cations of information retrieval (IR) and TM methods which canhandle tasks like identification of relevant documents sectionsand sentences recognition of concepts and relationships be-tween them or even generation of novel hypotheses [102]Introduction of computational text analysis methods into thecuration process has been reported to improve its efficiency bybetween 2- and 10-fold [103] Especially in the biomedical com-munity development of TM approaches has greatly benefitedfrom competitive evaluation challenges [104] Some prominentexamples of such initiatives include BioCreative TRECGenomicsMedicalCDS BioNLP-ST i2b2 and ShAReCLEFeHealth These regular competitions provide a valuable bench-mark for current state of the art for different tasks and serve tohighlight the most important problems focus community ef-forts and establish necessary annotated textual resources

As well as assisting with curation both TM and IR can beapplied in a high-throughput and automated manner howeverthe utility of these applications is still limited by accuracy andgeneralizability [103] In its simplest form various statistical co-occurrence methods can be applied to build networks linkingbiological concepts frequently found together in differentpapers Some examples of resources offering such literature-derived data are miRCancer that compiled data about co-occurrence of microRNA mentions with different cancer types[105] and DISEASES resource which provides similar informa-tion for gene mentions with particular diseases [106]

Extraction of information from text is contingent on unam-biguous identification of entities and relationship types fromwhich model statements can be constructed Therefore TMtechnologies have a natural synergy with ontology-based datamodelling and to a lesser extent Semantic Web formats likeRDF [102 107] The ontologies are used both as controlled vocab-ularies to resolve alternative naming conventions and as sour-ces of common identifiers for recovered concepts andrelationships Some examples of specialized relationship typesthat can be recovered include proteinndashprotein [108] drugndashpro-tein [109] and drugndashdrug [110 111] interactions and associ-ations of genes with mutations [112 113] and phenotypes [114ndash116] Data from IR and TM are increasingly incorporated along-side manually reviewed information by major databases for ex-ample PPIs extracted from literature by fully automatedapproaches are offered by STRING [13 117]

To handle more advanced tasks it is usually necessary tocombine specialized tools into TM workflows eg a hypothesisgeneration tool might rely on output of a relationships identifi-cation tool which in turn relies on named entity recognitionInevitably this raises questions of interoperability of differentanalysis methods One way to address this is has been throughincreasing use of Web services to link up different tools [118]Additionally specialized standards for biomedical textual datainterchange were developed for example XML-based BioCformat [119]

Recent efforts have been exploring the possibility of extract-ing more complex statements from text [120] and in 2015 theBioCreative V challenge has for the first time included a task ofautomated extraction of OpenBEL statements from text [121]During the competition the best result was achieved by theBELMiner framework [122] which used a rule-based approachfor extraction of BEL statements from text and got the highestF-measure of 20 for this task After the challenge BELSmilesystem was developed that reached an even better score of 28on the same dataset [64] BELSmile implemented a semantic-role-labelling approach which relies on recognition of verbs

(predicates) and assignment of roles to associated subjectsob-jects relative to that verb Notably designs of both frameworksintegrated multiple specialized IR and TM tools to handle spe-cific low-level subtasks

It is clear that at present a substantial proportion of valuablebiomedical knowledge is lsquotrappedrsquo in scientific text in a repre-sentation not readily suitable for automated computer-driveninterpretation Appropriate data formats and ontologies are anessential prerequisite for enabling sophisticated TM methods tobegin addressing this challenge The most recent developmentsindicate that current TM approaches are becoming powerfulenough to leverage such rich representations and are alreadyuseful to curators working on complex models of human dis-eases To explore further developments in this important areawe direct reader to the following recent reviews [25 102 123]

Outlook

Many translational and systems medicine projects involve col-lection of molecular and clinical data with a view to the identifi-cation of disease subtypes The data are typically warehousedand analysed for example to extract molecular fingerprintsthat characterize the subtypes and this data interpretation stepof the translational informatics pipeline remains highly chal-lenging Moving from diagnostic molecular patterns to beingable to suggest individualized healthcare pathways is complexData integration links the molecular patterns to backgroundknowledge which can then be reviewed explored and analysedto develop a more detailed understanding of the underlyingbiology that distinguishes disease subtypes

To support this process several standards have emerged tocapture integrate and facilitate analysis of these data This pro-liferation of standards suggests that different formalisms willcontinue to be necessary to model different aspects of biomed-ical data and that we are unlikely to converge on one ultimatemodelling solution in the foreseeable future The open questiontherefore becomes how to effectively manage the interoperabil-ity between these different representations In this respect sev-eral common trends and best practices are beginning to emergeIn particular it is now evident that development of commoncontrolled vocabularies and specialized domain ontologies is es-sential for effective management of increasingly complex bio-medical data Use of ontologies facilitates identification ofequivalent entities and concepts across different modelling for-mats and facilitates conversion between them allowing effi-cient data integration

As outlined by Cohen in [6] modern biomedical data model-ling formalisms can be categorized into lsquocurationrsquo and lsquomechanis-ticrsquo types Aim of the former is to primarily describe experimentalresults (eg Nanopublications BioPAX) including observed asso-ciations between molecular signatures and diseases The latterspecify evaluable models used to simulate living systems andgenerate predictions based on a set of inputs (eg SBML Kappa)It is obvious that although experimental observations are ultim-ately used to construct such models the processing stepsrequired to correctly generate the latter from the former wouldgo beyond a simple format change We believe that automationof this process is a highly promising emergent strategy for devel-opment of fully mechanistic models of disease Use of automatedmethods to support biomedical innovation is necessitated bothby the need to effectively deal with ever-growing volumes of dataand by the increasingly complex nature of biomedical researchitself The apparent diminishing returns in terms of noveldrugs successfully taken to market relative to the money spent

Navigating the disease landscape | 11

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

[124 125] with particular challenges for cardiovascular [126] andneurodegenerative diseases [127 128] mean that increasinglymore complex strategies for managing disease are needed tomake progress [129] Consequently high-throughput computa-tional approaches are becoming indispensable both for assistingwith interpretation management and mining of experimentaldata as well as for constructing and evaluating relevant clinicalmodels At present cutting-edge efforts in this area involve cre-ation of increasingly sophisticated automated methods for ana-lysis of vast quantities biomedical data exemplified by initiativeslike DARPArsquos Big Mechanism IBM Watson Health and Garudabiomedical discovery platforms The future modelling formatsare therefore likely to be influenced by the requirements of suchprojects and will increasingly incorporate features to supportfully computer-driven information analysis

In our view accurate handling of the context of biological ob-servations is critical for construction of correct mechanistic mod-els of disease Traditionally information has been compiled intolargely homogenized collections like pathway maps or globalinteractome networks that aim to show all possible processes of acertain type for a given organism However in practice only asmall subset of all these processes will be occurring in a real cellat a given time Furthermore such events may be transient (en-zymatic reactions) have stable outcome (protein complex forma-tion) or be mutually exclusive (alternative protein-bindingpartners) An additional level of complexity is introduced fromthe imperfection of experimental techniques For example tran-scriptomics profiling is often done at a tissue level which maymask important differences at a level of individual cells As suchmechanistic models built from these data may be subject to con-siderable uncertainty and abstraction The next generation of for-mats and integration solutions will therefore need to be morecontext aware Future progress on one hand requires bettermeans of expressing this highly granular context of biologicalprocesses and on the other necessitates solutions to model vastvolumes of data and knowledge in such form Once efficient andaccurate approaches for these tasks are established cutting-edgemathematical and machine learning methods could be leveragedto their full potential to give new insight into disease mechan-isms and suggest improved avenues for therapeutic intervention

Key Points

bull Precision medicine relies on identification of disease sub-types at a molecular level and linking them to diagnosticmodels to determine optimal treatment strategies

bull In practice establishing a contextual link between molecularsignatures and disease processes requires integration ofqualitatively diverse types of relevant data and knowledge

bull At present several alternative philosophies guide theprocess of transformation of experimental data intomechanistically informative and ultimately clinicallyactionable insights

bull The spectrum of possible approaches ranges frompurely associative knowledge link discovery to fullyquantitative mathematical models

Supplementary Data

Supplementary data are available online at httpsacademicoupcombib

Funding

The research leading to these results has received supportfrom the Innovative Medicines Initiative Joint UndertakingeTRIKS Project (Grant Number IMI 115446) resources ofwhich are composed of a financial contribution from theEuropean Unionrsquos Seventh Framework Programme (FP72007-2013) and European Federation of PharmaceuticalIndustries and Associations (EFPIA) companiesrsquo in kind con-tributions The work was also partially supported by CoreResearch for Evolutional Science and Technology (CREST)Grant from the Japan Science and Technology Agency(Grant Number JPMJCR1412) and Japan Society for thePromotion of Science KAKENHI (Grant Numbers 17H06307and 17H06299)

References1 Robinson PN Deep phenotyping for precision medicine

Hum Mutat 201233(5)777ndash802 Glaab E Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen classificationBrief Bioinform 201617(3)440ndash52

3 Woodruff PG Boushey HA Dolganov GM et al Genome-wide profiling identifies epithelial cell genes associated withasthma and with treatment response to corticosteroids ProcNatl Acad Sci USA 2007104(40)15858ndash63

4 Kaneko Y Yatagai Y Yamada H et al The search for com-mon pathways underlying asthma and COPD Int J ChronObstruct Pulmon Dis 2013865ndash78

5 Hofmann-Apitius M Ball G Gebel S et al Bioinformaticsmining and modeling methods for the identification of dis-ease mechanisms in neurodegenerative disorders Int J MolSci 201516(12)29179ndash206

6 Cohen PR DARPArsquos big mechanism program Phys Biol 201512(4)045008

7 Barabasi AL Network Science Cambridge United KingdomCambridge University Press 2016

8 Le Novere N Quantitative and logic modelling of molecularand gene networks Nat Rev Genet 201516(3)146ndash58

9 Boccaletti S Bianconi G Criado R et al The structure and dy-namics of multilayer networks Phys Rep 2014544(1)1ndash122

10 Vidal M Cusick ME Barabasi AL Interactome networks andhuman disease Cell 2011144(6)986ndash98

11 Pi~nero J Queralt-Rosinach N Bravo A et al DisGeNET a dis-covery platform for the dynamical exploration of humandiseases and their genes Database 20152015bav028

12 Voraphani N Gladwin M Contreras A et al An airway epi-thelial iNOSndashDUOX2ndashthyroid peroxidase metabolome drivesTh1Th2 nitrative stress in human severe asthma MucosalImmunol 20147(5)1175ndash85

13 Franceschini A Szklarczyk D Frankild S et al STRING v9 1protein-protein interaction networks with increasedcoverage and integration Nucleic Acids Res 201241(D1)D808ndash15

14 Montojo J Zuberi K Rodriguez H et al GeneMANIA fastgene network construction and function prediction forCytoscape F1000Res 20143153

15 Goh KI Cusick ME Valle D et al The human disease net-work Proc Natl Acad Sci USA 2007104(21)8685ndash90

16 Menche J Sharma A Kitsak M et al Uncovering disease-disease relationships through the incomplete interactomeScience 2015347(6224)1257601

12 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 9: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

designed to be used in combination with Semantic Web tech-nologies and uses RDF a format that allows seamless incorpor-ation of ontology terms and cross-references through globallyunique shared identifiers The advantages of this standard in-clude decentralized (federated) publishing machine-readablerepresentation of data and an ability to track provenance at thehighest possible level of granularity

Knowledge network representations of biological data are in-creasingly being leveraged in in integrative analysis platformswhich aim to combine information resources with commonanalysis and visualization tasks One example of such platformsis Ingenuity Pathway Analysis (IPAVR ) IPAVR suite aims to facili-tate interpretation of experimental data from high-throughputomics experiments and includes methods for identification ofcausal networks and visual exploration of biological graphs [66]The BioMax BioXMTM platform uses an object relational data-base architecture to store query and envision disease informa-tion including molecular genetic physiological and clinicaldata as well as biological models This platform has been usedto create a resource for chronic obstructive pulmonary disease(COPD) called lsquoCOPD knowledge basersquo [67ndash70] The authors ofCOPD knowledge base do acknowledge that much disease-related information remains hidden in the literature and alsorecognize the difficulties of capturing quality and context speci-ficity The Malacards resource [71] uses a relational database forthe semantic integration of multiple sources of informationabout diseases and also provides network visualizations of dis-easendashdisease relationships GeneAnalytics [72] enables context-ualization of functional signatures (gene sets) in the context oforgans and tissues and compartments as well as other annota-tion sources building on the GeneCards [73] and Malacardstechnologies

Recent advances in graph-based databases have opened upadditional possibilities for organizing heterogeneous and inter-linked data Graph databases have several features that makethem particularly attractive for management and exploratoryanalysis of complex biological data sets A particularly notewor-thy feature is excellent performance of traversal-type queriesacross entities of diverse types they offer Such queries could bechallenging to realize in relational databases because of the costof computing statements across different tables Traversalqueries can be an important way of identifying new relation-ships in integrated data particularly in the cases where linksonly become apparent once multiple sources have been inte-grated but may not be immediately obvious because of sheeramounts of data being involved Modern graph databases offersophisticated query languages and often also other types offramework-specific functionality to facilitate applied analysislike an ability to extend the framework with user-defined func-tions [74] and integration with ontologies and reasoning sys-tems One example of a graph database solution that has beenalready used in biomedical domain [75ndash78] is Neo4J DBMSNeo4J is a Java-based solution which also offers its own graphquery language (Cypher) which is conceptually similar to SQL

Semantic Web [79] RDF [80] and Linked Open Data [81] areemerging important standards for sharing biomedical data Atits core Semantic Web is characterized by its use of globallyunique unambiguous identifiers for entities The identifier usedfor this purpose is commonly a web link (URL) The RDF formatis structured as a set of statements with three parts (subjectpredicate and object) where a subject and object are entitieswith a unique identifier and predicate defines the type of a re-lationship between them Linked Open Data standard estab-lished a set of rules for providing meaningful RDF at the

resolvable identifier URL which generally is in the form of state-ments that provide meaningful information about that entitythat can in turn link it to other URL-identified entities and fi-nally the entities and statements can be linked to specializedontologies in OWL [82] that provide additional informationabout how to interpret them within a particular context

From the perspective of biomedical data integration RDFprimarily plays the role of a low-level data exchange standarduseful for addressing the syntactic heterogeneity issue andfacilitating data integration Although RDF representations canbe interpreted as integrated knowledge networks usually anadditional set of standards will be necessary for meaningful in-terpretation of data from a biological perspective Prominent re-sources that offer biological data on the Semantic Web includeEBI RDF platform [83] (covers such important databases likeUniProt [84] Reactome [85] Ensembl [86] ChEMBL [87] andExpression Atlas [88]) Bio2RDF (aims to mirror in RDF most ofthe prominent biological databases) [89] and OpenPhacts (usesSemantic Web technologies to represent the chemogenomicsspace) [90] Some noteworthy ontologies used in combinationwith these resources include Experimental Factor Ontology(EFO) [91] SBO [92] Translational Medicine Ontology (TMO) [93]SNOMED (medical terms and concepts) [94] and GO (functionalannotation of genes and proteins) [56]

To illustrate the application of integrated knowledge net-works with our example 22-gene severe asthma signature wehave used a Neo4j-based Hetionet v10 resource [95] that inte-grates a wide variety of biomedical information First a querywas prepared to visualize all of the diseases linked to the sig-nature genes (Figure 4) According to these results only onegene has already been linked to asthma though three othergenes have been linked to other lung diseases (COPD and idio-pathic pulmonary fibrosis) Next the connections betweenknown asthma-related pathways severe asthma differentialexpression signature and drugs targeting those common path-ways were retrieved In terms of query-specific degree of drugnodes (ie number of links to returned proteins) top threedrugs were dexamethasone betamethasone and niclosamideThe former two are corticosteroids already used for treatingasthma [96] and the last one was recently proposed as a highlypromising treatment because of its strong TMEM16A antagon-ism found to improve lung function in human and mousemodels [97]

Approaches for extraction ofcontextual knowledge

Ultimately all types of models in biomedical domain originatefrom data collected during relevant experiments and such datawould typically require both different levels of preprocessing andinterpretation to yield such models At present relatively fewtypes of raw experimental data can be straightforwardly con-verted into these representations without requiring human inter-pretation Some possible examples are construction of co-expression networks [98 99] from transcriptomic data and se-quence homology graphs [100 101] However currently manualinterpretation and publishing of inferred models as free text inscientific papers remains the primary means of disseminatingthis knowledge and all major biomedical databases discussed inthis review expend considerable curation efforts to collect verifyand consolidate it in standardized formats from source literature

Given the volume and ever-increasing rate of scientific pub-lishing automated means of facilitating this process are

Navigating the disease landscape | 9

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Figure 4 Query expressed in natural language and resulting output produced for the differential expression gene signature using Hetionet graph database The follow-

ing entities are shown proteins (blue) pathways (yellow) diseases (brown) and drugs (red) (A) Query for associated diseases (B) query to explore the connections be-

tween niclosamide drug asthma and differential expression signature linked via relevant proteins and pathways

10 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

becoming increasingly essential This is achieved through appli-cations of information retrieval (IR) and TM methods which canhandle tasks like identification of relevant documents sectionsand sentences recognition of concepts and relationships be-tween them or even generation of novel hypotheses [102]Introduction of computational text analysis methods into thecuration process has been reported to improve its efficiency bybetween 2- and 10-fold [103] Especially in the biomedical com-munity development of TM approaches has greatly benefitedfrom competitive evaluation challenges [104] Some prominentexamples of such initiatives include BioCreative TRECGenomicsMedicalCDS BioNLP-ST i2b2 and ShAReCLEFeHealth These regular competitions provide a valuable bench-mark for current state of the art for different tasks and serve tohighlight the most important problems focus community ef-forts and establish necessary annotated textual resources

As well as assisting with curation both TM and IR can beapplied in a high-throughput and automated manner howeverthe utility of these applications is still limited by accuracy andgeneralizability [103] In its simplest form various statistical co-occurrence methods can be applied to build networks linkingbiological concepts frequently found together in differentpapers Some examples of resources offering such literature-derived data are miRCancer that compiled data about co-occurrence of microRNA mentions with different cancer types[105] and DISEASES resource which provides similar informa-tion for gene mentions with particular diseases [106]

Extraction of information from text is contingent on unam-biguous identification of entities and relationship types fromwhich model statements can be constructed Therefore TMtechnologies have a natural synergy with ontology-based datamodelling and to a lesser extent Semantic Web formats likeRDF [102 107] The ontologies are used both as controlled vocab-ularies to resolve alternative naming conventions and as sour-ces of common identifiers for recovered concepts andrelationships Some examples of specialized relationship typesthat can be recovered include proteinndashprotein [108] drugndashpro-tein [109] and drugndashdrug [110 111] interactions and associ-ations of genes with mutations [112 113] and phenotypes [114ndash116] Data from IR and TM are increasingly incorporated along-side manually reviewed information by major databases for ex-ample PPIs extracted from literature by fully automatedapproaches are offered by STRING [13 117]

To handle more advanced tasks it is usually necessary tocombine specialized tools into TM workflows eg a hypothesisgeneration tool might rely on output of a relationships identifi-cation tool which in turn relies on named entity recognitionInevitably this raises questions of interoperability of differentanalysis methods One way to address this is has been throughincreasing use of Web services to link up different tools [118]Additionally specialized standards for biomedical textual datainterchange were developed for example XML-based BioCformat [119]

Recent efforts have been exploring the possibility of extract-ing more complex statements from text [120] and in 2015 theBioCreative V challenge has for the first time included a task ofautomated extraction of OpenBEL statements from text [121]During the competition the best result was achieved by theBELMiner framework [122] which used a rule-based approachfor extraction of BEL statements from text and got the highestF-measure of 20 for this task After the challenge BELSmilesystem was developed that reached an even better score of 28on the same dataset [64] BELSmile implemented a semantic-role-labelling approach which relies on recognition of verbs

(predicates) and assignment of roles to associated subjectsob-jects relative to that verb Notably designs of both frameworksintegrated multiple specialized IR and TM tools to handle spe-cific low-level subtasks

It is clear that at present a substantial proportion of valuablebiomedical knowledge is lsquotrappedrsquo in scientific text in a repre-sentation not readily suitable for automated computer-driveninterpretation Appropriate data formats and ontologies are anessential prerequisite for enabling sophisticated TM methods tobegin addressing this challenge The most recent developmentsindicate that current TM approaches are becoming powerfulenough to leverage such rich representations and are alreadyuseful to curators working on complex models of human dis-eases To explore further developments in this important areawe direct reader to the following recent reviews [25 102 123]

Outlook

Many translational and systems medicine projects involve col-lection of molecular and clinical data with a view to the identifi-cation of disease subtypes The data are typically warehousedand analysed for example to extract molecular fingerprintsthat characterize the subtypes and this data interpretation stepof the translational informatics pipeline remains highly chal-lenging Moving from diagnostic molecular patterns to beingable to suggest individualized healthcare pathways is complexData integration links the molecular patterns to backgroundknowledge which can then be reviewed explored and analysedto develop a more detailed understanding of the underlyingbiology that distinguishes disease subtypes

To support this process several standards have emerged tocapture integrate and facilitate analysis of these data This pro-liferation of standards suggests that different formalisms willcontinue to be necessary to model different aspects of biomed-ical data and that we are unlikely to converge on one ultimatemodelling solution in the foreseeable future The open questiontherefore becomes how to effectively manage the interoperabil-ity between these different representations In this respect sev-eral common trends and best practices are beginning to emergeIn particular it is now evident that development of commoncontrolled vocabularies and specialized domain ontologies is es-sential for effective management of increasingly complex bio-medical data Use of ontologies facilitates identification ofequivalent entities and concepts across different modelling for-mats and facilitates conversion between them allowing effi-cient data integration

As outlined by Cohen in [6] modern biomedical data model-ling formalisms can be categorized into lsquocurationrsquo and lsquomechanis-ticrsquo types Aim of the former is to primarily describe experimentalresults (eg Nanopublications BioPAX) including observed asso-ciations between molecular signatures and diseases The latterspecify evaluable models used to simulate living systems andgenerate predictions based on a set of inputs (eg SBML Kappa)It is obvious that although experimental observations are ultim-ately used to construct such models the processing stepsrequired to correctly generate the latter from the former wouldgo beyond a simple format change We believe that automationof this process is a highly promising emergent strategy for devel-opment of fully mechanistic models of disease Use of automatedmethods to support biomedical innovation is necessitated bothby the need to effectively deal with ever-growing volumes of dataand by the increasingly complex nature of biomedical researchitself The apparent diminishing returns in terms of noveldrugs successfully taken to market relative to the money spent

Navigating the disease landscape | 11

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

[124 125] with particular challenges for cardiovascular [126] andneurodegenerative diseases [127 128] mean that increasinglymore complex strategies for managing disease are needed tomake progress [129] Consequently high-throughput computa-tional approaches are becoming indispensable both for assistingwith interpretation management and mining of experimentaldata as well as for constructing and evaluating relevant clinicalmodels At present cutting-edge efforts in this area involve cre-ation of increasingly sophisticated automated methods for ana-lysis of vast quantities biomedical data exemplified by initiativeslike DARPArsquos Big Mechanism IBM Watson Health and Garudabiomedical discovery platforms The future modelling formatsare therefore likely to be influenced by the requirements of suchprojects and will increasingly incorporate features to supportfully computer-driven information analysis

In our view accurate handling of the context of biological ob-servations is critical for construction of correct mechanistic mod-els of disease Traditionally information has been compiled intolargely homogenized collections like pathway maps or globalinteractome networks that aim to show all possible processes of acertain type for a given organism However in practice only asmall subset of all these processes will be occurring in a real cellat a given time Furthermore such events may be transient (en-zymatic reactions) have stable outcome (protein complex forma-tion) or be mutually exclusive (alternative protein-bindingpartners) An additional level of complexity is introduced fromthe imperfection of experimental techniques For example tran-scriptomics profiling is often done at a tissue level which maymask important differences at a level of individual cells As suchmechanistic models built from these data may be subject to con-siderable uncertainty and abstraction The next generation of for-mats and integration solutions will therefore need to be morecontext aware Future progress on one hand requires bettermeans of expressing this highly granular context of biologicalprocesses and on the other necessitates solutions to model vastvolumes of data and knowledge in such form Once efficient andaccurate approaches for these tasks are established cutting-edgemathematical and machine learning methods could be leveragedto their full potential to give new insight into disease mechan-isms and suggest improved avenues for therapeutic intervention

Key Points

bull Precision medicine relies on identification of disease sub-types at a molecular level and linking them to diagnosticmodels to determine optimal treatment strategies

bull In practice establishing a contextual link between molecularsignatures and disease processes requires integration ofqualitatively diverse types of relevant data and knowledge

bull At present several alternative philosophies guide theprocess of transformation of experimental data intomechanistically informative and ultimately clinicallyactionable insights

bull The spectrum of possible approaches ranges frompurely associative knowledge link discovery to fullyquantitative mathematical models

Supplementary Data

Supplementary data are available online at httpsacademicoupcombib

Funding

The research leading to these results has received supportfrom the Innovative Medicines Initiative Joint UndertakingeTRIKS Project (Grant Number IMI 115446) resources ofwhich are composed of a financial contribution from theEuropean Unionrsquos Seventh Framework Programme (FP72007-2013) and European Federation of PharmaceuticalIndustries and Associations (EFPIA) companiesrsquo in kind con-tributions The work was also partially supported by CoreResearch for Evolutional Science and Technology (CREST)Grant from the Japan Science and Technology Agency(Grant Number JPMJCR1412) and Japan Society for thePromotion of Science KAKENHI (Grant Numbers 17H06307and 17H06299)

References1 Robinson PN Deep phenotyping for precision medicine

Hum Mutat 201233(5)777ndash802 Glaab E Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen classificationBrief Bioinform 201617(3)440ndash52

3 Woodruff PG Boushey HA Dolganov GM et al Genome-wide profiling identifies epithelial cell genes associated withasthma and with treatment response to corticosteroids ProcNatl Acad Sci USA 2007104(40)15858ndash63

4 Kaneko Y Yatagai Y Yamada H et al The search for com-mon pathways underlying asthma and COPD Int J ChronObstruct Pulmon Dis 2013865ndash78

5 Hofmann-Apitius M Ball G Gebel S et al Bioinformaticsmining and modeling methods for the identification of dis-ease mechanisms in neurodegenerative disorders Int J MolSci 201516(12)29179ndash206

6 Cohen PR DARPArsquos big mechanism program Phys Biol 201512(4)045008

7 Barabasi AL Network Science Cambridge United KingdomCambridge University Press 2016

8 Le Novere N Quantitative and logic modelling of molecularand gene networks Nat Rev Genet 201516(3)146ndash58

9 Boccaletti S Bianconi G Criado R et al The structure and dy-namics of multilayer networks Phys Rep 2014544(1)1ndash122

10 Vidal M Cusick ME Barabasi AL Interactome networks andhuman disease Cell 2011144(6)986ndash98

11 Pi~nero J Queralt-Rosinach N Bravo A et al DisGeNET a dis-covery platform for the dynamical exploration of humandiseases and their genes Database 20152015bav028

12 Voraphani N Gladwin M Contreras A et al An airway epi-thelial iNOSndashDUOX2ndashthyroid peroxidase metabolome drivesTh1Th2 nitrative stress in human severe asthma MucosalImmunol 20147(5)1175ndash85

13 Franceschini A Szklarczyk D Frankild S et al STRING v9 1protein-protein interaction networks with increasedcoverage and integration Nucleic Acids Res 201241(D1)D808ndash15

14 Montojo J Zuberi K Rodriguez H et al GeneMANIA fastgene network construction and function prediction forCytoscape F1000Res 20143153

15 Goh KI Cusick ME Valle D et al The human disease net-work Proc Natl Acad Sci USA 2007104(21)8685ndash90

16 Menche J Sharma A Kitsak M et al Uncovering disease-disease relationships through the incomplete interactomeScience 2015347(6224)1257601

12 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 10: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

Figure 4 Query expressed in natural language and resulting output produced for the differential expression gene signature using Hetionet graph database The follow-

ing entities are shown proteins (blue) pathways (yellow) diseases (brown) and drugs (red) (A) Query for associated diseases (B) query to explore the connections be-

tween niclosamide drug asthma and differential expression signature linked via relevant proteins and pathways

10 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

becoming increasingly essential This is achieved through appli-cations of information retrieval (IR) and TM methods which canhandle tasks like identification of relevant documents sectionsand sentences recognition of concepts and relationships be-tween them or even generation of novel hypotheses [102]Introduction of computational text analysis methods into thecuration process has been reported to improve its efficiency bybetween 2- and 10-fold [103] Especially in the biomedical com-munity development of TM approaches has greatly benefitedfrom competitive evaluation challenges [104] Some prominentexamples of such initiatives include BioCreative TRECGenomicsMedicalCDS BioNLP-ST i2b2 and ShAReCLEFeHealth These regular competitions provide a valuable bench-mark for current state of the art for different tasks and serve tohighlight the most important problems focus community ef-forts and establish necessary annotated textual resources

As well as assisting with curation both TM and IR can beapplied in a high-throughput and automated manner howeverthe utility of these applications is still limited by accuracy andgeneralizability [103] In its simplest form various statistical co-occurrence methods can be applied to build networks linkingbiological concepts frequently found together in differentpapers Some examples of resources offering such literature-derived data are miRCancer that compiled data about co-occurrence of microRNA mentions with different cancer types[105] and DISEASES resource which provides similar informa-tion for gene mentions with particular diseases [106]

Extraction of information from text is contingent on unam-biguous identification of entities and relationship types fromwhich model statements can be constructed Therefore TMtechnologies have a natural synergy with ontology-based datamodelling and to a lesser extent Semantic Web formats likeRDF [102 107] The ontologies are used both as controlled vocab-ularies to resolve alternative naming conventions and as sour-ces of common identifiers for recovered concepts andrelationships Some examples of specialized relationship typesthat can be recovered include proteinndashprotein [108] drugndashpro-tein [109] and drugndashdrug [110 111] interactions and associ-ations of genes with mutations [112 113] and phenotypes [114ndash116] Data from IR and TM are increasingly incorporated along-side manually reviewed information by major databases for ex-ample PPIs extracted from literature by fully automatedapproaches are offered by STRING [13 117]

To handle more advanced tasks it is usually necessary tocombine specialized tools into TM workflows eg a hypothesisgeneration tool might rely on output of a relationships identifi-cation tool which in turn relies on named entity recognitionInevitably this raises questions of interoperability of differentanalysis methods One way to address this is has been throughincreasing use of Web services to link up different tools [118]Additionally specialized standards for biomedical textual datainterchange were developed for example XML-based BioCformat [119]

Recent efforts have been exploring the possibility of extract-ing more complex statements from text [120] and in 2015 theBioCreative V challenge has for the first time included a task ofautomated extraction of OpenBEL statements from text [121]During the competition the best result was achieved by theBELMiner framework [122] which used a rule-based approachfor extraction of BEL statements from text and got the highestF-measure of 20 for this task After the challenge BELSmilesystem was developed that reached an even better score of 28on the same dataset [64] BELSmile implemented a semantic-role-labelling approach which relies on recognition of verbs

(predicates) and assignment of roles to associated subjectsob-jects relative to that verb Notably designs of both frameworksintegrated multiple specialized IR and TM tools to handle spe-cific low-level subtasks

It is clear that at present a substantial proportion of valuablebiomedical knowledge is lsquotrappedrsquo in scientific text in a repre-sentation not readily suitable for automated computer-driveninterpretation Appropriate data formats and ontologies are anessential prerequisite for enabling sophisticated TM methods tobegin addressing this challenge The most recent developmentsindicate that current TM approaches are becoming powerfulenough to leverage such rich representations and are alreadyuseful to curators working on complex models of human dis-eases To explore further developments in this important areawe direct reader to the following recent reviews [25 102 123]

Outlook

Many translational and systems medicine projects involve col-lection of molecular and clinical data with a view to the identifi-cation of disease subtypes The data are typically warehousedand analysed for example to extract molecular fingerprintsthat characterize the subtypes and this data interpretation stepof the translational informatics pipeline remains highly chal-lenging Moving from diagnostic molecular patterns to beingable to suggest individualized healthcare pathways is complexData integration links the molecular patterns to backgroundknowledge which can then be reviewed explored and analysedto develop a more detailed understanding of the underlyingbiology that distinguishes disease subtypes

To support this process several standards have emerged tocapture integrate and facilitate analysis of these data This pro-liferation of standards suggests that different formalisms willcontinue to be necessary to model different aspects of biomed-ical data and that we are unlikely to converge on one ultimatemodelling solution in the foreseeable future The open questiontherefore becomes how to effectively manage the interoperabil-ity between these different representations In this respect sev-eral common trends and best practices are beginning to emergeIn particular it is now evident that development of commoncontrolled vocabularies and specialized domain ontologies is es-sential for effective management of increasingly complex bio-medical data Use of ontologies facilitates identification ofequivalent entities and concepts across different modelling for-mats and facilitates conversion between them allowing effi-cient data integration

As outlined by Cohen in [6] modern biomedical data model-ling formalisms can be categorized into lsquocurationrsquo and lsquomechanis-ticrsquo types Aim of the former is to primarily describe experimentalresults (eg Nanopublications BioPAX) including observed asso-ciations between molecular signatures and diseases The latterspecify evaluable models used to simulate living systems andgenerate predictions based on a set of inputs (eg SBML Kappa)It is obvious that although experimental observations are ultim-ately used to construct such models the processing stepsrequired to correctly generate the latter from the former wouldgo beyond a simple format change We believe that automationof this process is a highly promising emergent strategy for devel-opment of fully mechanistic models of disease Use of automatedmethods to support biomedical innovation is necessitated bothby the need to effectively deal with ever-growing volumes of dataand by the increasingly complex nature of biomedical researchitself The apparent diminishing returns in terms of noveldrugs successfully taken to market relative to the money spent

Navigating the disease landscape | 11

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

[124 125] with particular challenges for cardiovascular [126] andneurodegenerative diseases [127 128] mean that increasinglymore complex strategies for managing disease are needed tomake progress [129] Consequently high-throughput computa-tional approaches are becoming indispensable both for assistingwith interpretation management and mining of experimentaldata as well as for constructing and evaluating relevant clinicalmodels At present cutting-edge efforts in this area involve cre-ation of increasingly sophisticated automated methods for ana-lysis of vast quantities biomedical data exemplified by initiativeslike DARPArsquos Big Mechanism IBM Watson Health and Garudabiomedical discovery platforms The future modelling formatsare therefore likely to be influenced by the requirements of suchprojects and will increasingly incorporate features to supportfully computer-driven information analysis

In our view accurate handling of the context of biological ob-servations is critical for construction of correct mechanistic mod-els of disease Traditionally information has been compiled intolargely homogenized collections like pathway maps or globalinteractome networks that aim to show all possible processes of acertain type for a given organism However in practice only asmall subset of all these processes will be occurring in a real cellat a given time Furthermore such events may be transient (en-zymatic reactions) have stable outcome (protein complex forma-tion) or be mutually exclusive (alternative protein-bindingpartners) An additional level of complexity is introduced fromthe imperfection of experimental techniques For example tran-scriptomics profiling is often done at a tissue level which maymask important differences at a level of individual cells As suchmechanistic models built from these data may be subject to con-siderable uncertainty and abstraction The next generation of for-mats and integration solutions will therefore need to be morecontext aware Future progress on one hand requires bettermeans of expressing this highly granular context of biologicalprocesses and on the other necessitates solutions to model vastvolumes of data and knowledge in such form Once efficient andaccurate approaches for these tasks are established cutting-edgemathematical and machine learning methods could be leveragedto their full potential to give new insight into disease mechan-isms and suggest improved avenues for therapeutic intervention

Key Points

bull Precision medicine relies on identification of disease sub-types at a molecular level and linking them to diagnosticmodels to determine optimal treatment strategies

bull In practice establishing a contextual link between molecularsignatures and disease processes requires integration ofqualitatively diverse types of relevant data and knowledge

bull At present several alternative philosophies guide theprocess of transformation of experimental data intomechanistically informative and ultimately clinicallyactionable insights

bull The spectrum of possible approaches ranges frompurely associative knowledge link discovery to fullyquantitative mathematical models

Supplementary Data

Supplementary data are available online at httpsacademicoupcombib

Funding

The research leading to these results has received supportfrom the Innovative Medicines Initiative Joint UndertakingeTRIKS Project (Grant Number IMI 115446) resources ofwhich are composed of a financial contribution from theEuropean Unionrsquos Seventh Framework Programme (FP72007-2013) and European Federation of PharmaceuticalIndustries and Associations (EFPIA) companiesrsquo in kind con-tributions The work was also partially supported by CoreResearch for Evolutional Science and Technology (CREST)Grant from the Japan Science and Technology Agency(Grant Number JPMJCR1412) and Japan Society for thePromotion of Science KAKENHI (Grant Numbers 17H06307and 17H06299)

References1 Robinson PN Deep phenotyping for precision medicine

Hum Mutat 201233(5)777ndash802 Glaab E Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen classificationBrief Bioinform 201617(3)440ndash52

3 Woodruff PG Boushey HA Dolganov GM et al Genome-wide profiling identifies epithelial cell genes associated withasthma and with treatment response to corticosteroids ProcNatl Acad Sci USA 2007104(40)15858ndash63

4 Kaneko Y Yatagai Y Yamada H et al The search for com-mon pathways underlying asthma and COPD Int J ChronObstruct Pulmon Dis 2013865ndash78

5 Hofmann-Apitius M Ball G Gebel S et al Bioinformaticsmining and modeling methods for the identification of dis-ease mechanisms in neurodegenerative disorders Int J MolSci 201516(12)29179ndash206

6 Cohen PR DARPArsquos big mechanism program Phys Biol 201512(4)045008

7 Barabasi AL Network Science Cambridge United KingdomCambridge University Press 2016

8 Le Novere N Quantitative and logic modelling of molecularand gene networks Nat Rev Genet 201516(3)146ndash58

9 Boccaletti S Bianconi G Criado R et al The structure and dy-namics of multilayer networks Phys Rep 2014544(1)1ndash122

10 Vidal M Cusick ME Barabasi AL Interactome networks andhuman disease Cell 2011144(6)986ndash98

11 Pi~nero J Queralt-Rosinach N Bravo A et al DisGeNET a dis-covery platform for the dynamical exploration of humandiseases and their genes Database 20152015bav028

12 Voraphani N Gladwin M Contreras A et al An airway epi-thelial iNOSndashDUOX2ndashthyroid peroxidase metabolome drivesTh1Th2 nitrative stress in human severe asthma MucosalImmunol 20147(5)1175ndash85

13 Franceschini A Szklarczyk D Frankild S et al STRING v9 1protein-protein interaction networks with increasedcoverage and integration Nucleic Acids Res 201241(D1)D808ndash15

14 Montojo J Zuberi K Rodriguez H et al GeneMANIA fastgene network construction and function prediction forCytoscape F1000Res 20143153

15 Goh KI Cusick ME Valle D et al The human disease net-work Proc Natl Acad Sci USA 2007104(21)8685ndash90

16 Menche J Sharma A Kitsak M et al Uncovering disease-disease relationships through the incomplete interactomeScience 2015347(6224)1257601

12 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 11: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

becoming increasingly essential This is achieved through appli-cations of information retrieval (IR) and TM methods which canhandle tasks like identification of relevant documents sectionsand sentences recognition of concepts and relationships be-tween them or even generation of novel hypotheses [102]Introduction of computational text analysis methods into thecuration process has been reported to improve its efficiency bybetween 2- and 10-fold [103] Especially in the biomedical com-munity development of TM approaches has greatly benefitedfrom competitive evaluation challenges [104] Some prominentexamples of such initiatives include BioCreative TRECGenomicsMedicalCDS BioNLP-ST i2b2 and ShAReCLEFeHealth These regular competitions provide a valuable bench-mark for current state of the art for different tasks and serve tohighlight the most important problems focus community ef-forts and establish necessary annotated textual resources

As well as assisting with curation both TM and IR can beapplied in a high-throughput and automated manner howeverthe utility of these applications is still limited by accuracy andgeneralizability [103] In its simplest form various statistical co-occurrence methods can be applied to build networks linkingbiological concepts frequently found together in differentpapers Some examples of resources offering such literature-derived data are miRCancer that compiled data about co-occurrence of microRNA mentions with different cancer types[105] and DISEASES resource which provides similar informa-tion for gene mentions with particular diseases [106]

Extraction of information from text is contingent on unam-biguous identification of entities and relationship types fromwhich model statements can be constructed Therefore TMtechnologies have a natural synergy with ontology-based datamodelling and to a lesser extent Semantic Web formats likeRDF [102 107] The ontologies are used both as controlled vocab-ularies to resolve alternative naming conventions and as sour-ces of common identifiers for recovered concepts andrelationships Some examples of specialized relationship typesthat can be recovered include proteinndashprotein [108] drugndashpro-tein [109] and drugndashdrug [110 111] interactions and associ-ations of genes with mutations [112 113] and phenotypes [114ndash116] Data from IR and TM are increasingly incorporated along-side manually reviewed information by major databases for ex-ample PPIs extracted from literature by fully automatedapproaches are offered by STRING [13 117]

To handle more advanced tasks it is usually necessary tocombine specialized tools into TM workflows eg a hypothesisgeneration tool might rely on output of a relationships identifi-cation tool which in turn relies on named entity recognitionInevitably this raises questions of interoperability of differentanalysis methods One way to address this is has been throughincreasing use of Web services to link up different tools [118]Additionally specialized standards for biomedical textual datainterchange were developed for example XML-based BioCformat [119]

Recent efforts have been exploring the possibility of extract-ing more complex statements from text [120] and in 2015 theBioCreative V challenge has for the first time included a task ofautomated extraction of OpenBEL statements from text [121]During the competition the best result was achieved by theBELMiner framework [122] which used a rule-based approachfor extraction of BEL statements from text and got the highestF-measure of 20 for this task After the challenge BELSmilesystem was developed that reached an even better score of 28on the same dataset [64] BELSmile implemented a semantic-role-labelling approach which relies on recognition of verbs

(predicates) and assignment of roles to associated subjectsob-jects relative to that verb Notably designs of both frameworksintegrated multiple specialized IR and TM tools to handle spe-cific low-level subtasks

It is clear that at present a substantial proportion of valuablebiomedical knowledge is lsquotrappedrsquo in scientific text in a repre-sentation not readily suitable for automated computer-driveninterpretation Appropriate data formats and ontologies are anessential prerequisite for enabling sophisticated TM methods tobegin addressing this challenge The most recent developmentsindicate that current TM approaches are becoming powerfulenough to leverage such rich representations and are alreadyuseful to curators working on complex models of human dis-eases To explore further developments in this important areawe direct reader to the following recent reviews [25 102 123]

Outlook

Many translational and systems medicine projects involve col-lection of molecular and clinical data with a view to the identifi-cation of disease subtypes The data are typically warehousedand analysed for example to extract molecular fingerprintsthat characterize the subtypes and this data interpretation stepof the translational informatics pipeline remains highly chal-lenging Moving from diagnostic molecular patterns to beingable to suggest individualized healthcare pathways is complexData integration links the molecular patterns to backgroundknowledge which can then be reviewed explored and analysedto develop a more detailed understanding of the underlyingbiology that distinguishes disease subtypes

To support this process several standards have emerged tocapture integrate and facilitate analysis of these data This pro-liferation of standards suggests that different formalisms willcontinue to be necessary to model different aspects of biomed-ical data and that we are unlikely to converge on one ultimatemodelling solution in the foreseeable future The open questiontherefore becomes how to effectively manage the interoperabil-ity between these different representations In this respect sev-eral common trends and best practices are beginning to emergeIn particular it is now evident that development of commoncontrolled vocabularies and specialized domain ontologies is es-sential for effective management of increasingly complex bio-medical data Use of ontologies facilitates identification ofequivalent entities and concepts across different modelling for-mats and facilitates conversion between them allowing effi-cient data integration

As outlined by Cohen in [6] modern biomedical data model-ling formalisms can be categorized into lsquocurationrsquo and lsquomechanis-ticrsquo types Aim of the former is to primarily describe experimentalresults (eg Nanopublications BioPAX) including observed asso-ciations between molecular signatures and diseases The latterspecify evaluable models used to simulate living systems andgenerate predictions based on a set of inputs (eg SBML Kappa)It is obvious that although experimental observations are ultim-ately used to construct such models the processing stepsrequired to correctly generate the latter from the former wouldgo beyond a simple format change We believe that automationof this process is a highly promising emergent strategy for devel-opment of fully mechanistic models of disease Use of automatedmethods to support biomedical innovation is necessitated bothby the need to effectively deal with ever-growing volumes of dataand by the increasingly complex nature of biomedical researchitself The apparent diminishing returns in terms of noveldrugs successfully taken to market relative to the money spent

Navigating the disease landscape | 11

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

[124 125] with particular challenges for cardiovascular [126] andneurodegenerative diseases [127 128] mean that increasinglymore complex strategies for managing disease are needed tomake progress [129] Consequently high-throughput computa-tional approaches are becoming indispensable both for assistingwith interpretation management and mining of experimentaldata as well as for constructing and evaluating relevant clinicalmodels At present cutting-edge efforts in this area involve cre-ation of increasingly sophisticated automated methods for ana-lysis of vast quantities biomedical data exemplified by initiativeslike DARPArsquos Big Mechanism IBM Watson Health and Garudabiomedical discovery platforms The future modelling formatsare therefore likely to be influenced by the requirements of suchprojects and will increasingly incorporate features to supportfully computer-driven information analysis

In our view accurate handling of the context of biological ob-servations is critical for construction of correct mechanistic mod-els of disease Traditionally information has been compiled intolargely homogenized collections like pathway maps or globalinteractome networks that aim to show all possible processes of acertain type for a given organism However in practice only asmall subset of all these processes will be occurring in a real cellat a given time Furthermore such events may be transient (en-zymatic reactions) have stable outcome (protein complex forma-tion) or be mutually exclusive (alternative protein-bindingpartners) An additional level of complexity is introduced fromthe imperfection of experimental techniques For example tran-scriptomics profiling is often done at a tissue level which maymask important differences at a level of individual cells As suchmechanistic models built from these data may be subject to con-siderable uncertainty and abstraction The next generation of for-mats and integration solutions will therefore need to be morecontext aware Future progress on one hand requires bettermeans of expressing this highly granular context of biologicalprocesses and on the other necessitates solutions to model vastvolumes of data and knowledge in such form Once efficient andaccurate approaches for these tasks are established cutting-edgemathematical and machine learning methods could be leveragedto their full potential to give new insight into disease mechan-isms and suggest improved avenues for therapeutic intervention

Key Points

bull Precision medicine relies on identification of disease sub-types at a molecular level and linking them to diagnosticmodels to determine optimal treatment strategies

bull In practice establishing a contextual link between molecularsignatures and disease processes requires integration ofqualitatively diverse types of relevant data and knowledge

bull At present several alternative philosophies guide theprocess of transformation of experimental data intomechanistically informative and ultimately clinicallyactionable insights

bull The spectrum of possible approaches ranges frompurely associative knowledge link discovery to fullyquantitative mathematical models

Supplementary Data

Supplementary data are available online at httpsacademicoupcombib

Funding

The research leading to these results has received supportfrom the Innovative Medicines Initiative Joint UndertakingeTRIKS Project (Grant Number IMI 115446) resources ofwhich are composed of a financial contribution from theEuropean Unionrsquos Seventh Framework Programme (FP72007-2013) and European Federation of PharmaceuticalIndustries and Associations (EFPIA) companiesrsquo in kind con-tributions The work was also partially supported by CoreResearch for Evolutional Science and Technology (CREST)Grant from the Japan Science and Technology Agency(Grant Number JPMJCR1412) and Japan Society for thePromotion of Science KAKENHI (Grant Numbers 17H06307and 17H06299)

References1 Robinson PN Deep phenotyping for precision medicine

Hum Mutat 201233(5)777ndash802 Glaab E Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen classificationBrief Bioinform 201617(3)440ndash52

3 Woodruff PG Boushey HA Dolganov GM et al Genome-wide profiling identifies epithelial cell genes associated withasthma and with treatment response to corticosteroids ProcNatl Acad Sci USA 2007104(40)15858ndash63

4 Kaneko Y Yatagai Y Yamada H et al The search for com-mon pathways underlying asthma and COPD Int J ChronObstruct Pulmon Dis 2013865ndash78

5 Hofmann-Apitius M Ball G Gebel S et al Bioinformaticsmining and modeling methods for the identification of dis-ease mechanisms in neurodegenerative disorders Int J MolSci 201516(12)29179ndash206

6 Cohen PR DARPArsquos big mechanism program Phys Biol 201512(4)045008

7 Barabasi AL Network Science Cambridge United KingdomCambridge University Press 2016

8 Le Novere N Quantitative and logic modelling of molecularand gene networks Nat Rev Genet 201516(3)146ndash58

9 Boccaletti S Bianconi G Criado R et al The structure and dy-namics of multilayer networks Phys Rep 2014544(1)1ndash122

10 Vidal M Cusick ME Barabasi AL Interactome networks andhuman disease Cell 2011144(6)986ndash98

11 Pi~nero J Queralt-Rosinach N Bravo A et al DisGeNET a dis-covery platform for the dynamical exploration of humandiseases and their genes Database 20152015bav028

12 Voraphani N Gladwin M Contreras A et al An airway epi-thelial iNOSndashDUOX2ndashthyroid peroxidase metabolome drivesTh1Th2 nitrative stress in human severe asthma MucosalImmunol 20147(5)1175ndash85

13 Franceschini A Szklarczyk D Frankild S et al STRING v9 1protein-protein interaction networks with increasedcoverage and integration Nucleic Acids Res 201241(D1)D808ndash15

14 Montojo J Zuberi K Rodriguez H et al GeneMANIA fastgene network construction and function prediction forCytoscape F1000Res 20143153

15 Goh KI Cusick ME Valle D et al The human disease net-work Proc Natl Acad Sci USA 2007104(21)8685ndash90

16 Menche J Sharma A Kitsak M et al Uncovering disease-disease relationships through the incomplete interactomeScience 2015347(6224)1257601

12 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 12: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

[124 125] with particular challenges for cardiovascular [126] andneurodegenerative diseases [127 128] mean that increasinglymore complex strategies for managing disease are needed tomake progress [129] Consequently high-throughput computa-tional approaches are becoming indispensable both for assistingwith interpretation management and mining of experimentaldata as well as for constructing and evaluating relevant clinicalmodels At present cutting-edge efforts in this area involve cre-ation of increasingly sophisticated automated methods for ana-lysis of vast quantities biomedical data exemplified by initiativeslike DARPArsquos Big Mechanism IBM Watson Health and Garudabiomedical discovery platforms The future modelling formatsare therefore likely to be influenced by the requirements of suchprojects and will increasingly incorporate features to supportfully computer-driven information analysis

In our view accurate handling of the context of biological ob-servations is critical for construction of correct mechanistic mod-els of disease Traditionally information has been compiled intolargely homogenized collections like pathway maps or globalinteractome networks that aim to show all possible processes of acertain type for a given organism However in practice only asmall subset of all these processes will be occurring in a real cellat a given time Furthermore such events may be transient (en-zymatic reactions) have stable outcome (protein complex forma-tion) or be mutually exclusive (alternative protein-bindingpartners) An additional level of complexity is introduced fromthe imperfection of experimental techniques For example tran-scriptomics profiling is often done at a tissue level which maymask important differences at a level of individual cells As suchmechanistic models built from these data may be subject to con-siderable uncertainty and abstraction The next generation of for-mats and integration solutions will therefore need to be morecontext aware Future progress on one hand requires bettermeans of expressing this highly granular context of biologicalprocesses and on the other necessitates solutions to model vastvolumes of data and knowledge in such form Once efficient andaccurate approaches for these tasks are established cutting-edgemathematical and machine learning methods could be leveragedto their full potential to give new insight into disease mechan-isms and suggest improved avenues for therapeutic intervention

Key Points

bull Precision medicine relies on identification of disease sub-types at a molecular level and linking them to diagnosticmodels to determine optimal treatment strategies

bull In practice establishing a contextual link between molecularsignatures and disease processes requires integration ofqualitatively diverse types of relevant data and knowledge

bull At present several alternative philosophies guide theprocess of transformation of experimental data intomechanistically informative and ultimately clinicallyactionable insights

bull The spectrum of possible approaches ranges frompurely associative knowledge link discovery to fullyquantitative mathematical models

Supplementary Data

Supplementary data are available online at httpsacademicoupcombib

Funding

The research leading to these results has received supportfrom the Innovative Medicines Initiative Joint UndertakingeTRIKS Project (Grant Number IMI 115446) resources ofwhich are composed of a financial contribution from theEuropean Unionrsquos Seventh Framework Programme (FP72007-2013) and European Federation of PharmaceuticalIndustries and Associations (EFPIA) companiesrsquo in kind con-tributions The work was also partially supported by CoreResearch for Evolutional Science and Technology (CREST)Grant from the Japan Science and Technology Agency(Grant Number JPMJCR1412) and Japan Society for thePromotion of Science KAKENHI (Grant Numbers 17H06307and 17H06299)

References1 Robinson PN Deep phenotyping for precision medicine

Hum Mutat 201233(5)777ndash802 Glaab E Using prior knowledge from cellular pathways and

molecular networks for diagnostic specimen classificationBrief Bioinform 201617(3)440ndash52

3 Woodruff PG Boushey HA Dolganov GM et al Genome-wide profiling identifies epithelial cell genes associated withasthma and with treatment response to corticosteroids ProcNatl Acad Sci USA 2007104(40)15858ndash63

4 Kaneko Y Yatagai Y Yamada H et al The search for com-mon pathways underlying asthma and COPD Int J ChronObstruct Pulmon Dis 2013865ndash78

5 Hofmann-Apitius M Ball G Gebel S et al Bioinformaticsmining and modeling methods for the identification of dis-ease mechanisms in neurodegenerative disorders Int J MolSci 201516(12)29179ndash206

6 Cohen PR DARPArsquos big mechanism program Phys Biol 201512(4)045008

7 Barabasi AL Network Science Cambridge United KingdomCambridge University Press 2016

8 Le Novere N Quantitative and logic modelling of molecularand gene networks Nat Rev Genet 201516(3)146ndash58

9 Boccaletti S Bianconi G Criado R et al The structure and dy-namics of multilayer networks Phys Rep 2014544(1)1ndash122

10 Vidal M Cusick ME Barabasi AL Interactome networks andhuman disease Cell 2011144(6)986ndash98

11 Pi~nero J Queralt-Rosinach N Bravo A et al DisGeNET a dis-covery platform for the dynamical exploration of humandiseases and their genes Database 20152015bav028

12 Voraphani N Gladwin M Contreras A et al An airway epi-thelial iNOSndashDUOX2ndashthyroid peroxidase metabolome drivesTh1Th2 nitrative stress in human severe asthma MucosalImmunol 20147(5)1175ndash85

13 Franceschini A Szklarczyk D Frankild S et al STRING v9 1protein-protein interaction networks with increasedcoverage and integration Nucleic Acids Res 201241(D1)D808ndash15

14 Montojo J Zuberi K Rodriguez H et al GeneMANIA fastgene network construction and function prediction forCytoscape F1000Res 20143153

15 Goh KI Cusick ME Valle D et al The human disease net-work Proc Natl Acad Sci USA 2007104(21)8685ndash90

16 Menche J Sharma A Kitsak M et al Uncovering disease-disease relationships through the incomplete interactomeScience 2015347(6224)1257601

12 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 13: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

17 Gustafsson M Nestor CE Zhang H et al Modules networksand systems medicine for understanding disease and aidingdiagnosis Genome Med 20146(10)82

18 Guo L Du Y Wang J Network analysis reveals a stress-affected common gene module among seven stress-relateddiseasessystems which provides potential targets formechanism research Sci Rep 20155(1)12939

19 Novershtern N Itzhaki Z Manor O et al A functional andregulatory map of asthma Am J Respir Cell Mol Biol 200838(3)324ndash36

20 Sharma A Menche J Huang CC et al A disease module inthe interactome explains disease heterogeneity drug re-sponse and captures novel pathways and genes in asthmaHum Mol Genet 2015243005ndash20

21 Derous D Kelder T Schothorst EM et al Network-based in-tegration of molecular and physiological data elucidatesregulatory mechanisms underlying adaptation to high-fatdiet Genes Nutr 201510(4)470

22 Kelder T Summer G Caspers M et al White adipose tissuereference network a knowledge resource for exploringhealth-relevant relations Genes Nutr 201510(1)439

23 Zitnik M Janjic V Larminie C et al Discovering disease-disease associations by fusing systems-level moleculardata Sci Rep 201333202

24 Gligorijevic V Przulj N Methods for biological data integra-tion perspectives and challenges J R Soc Interface 201512(112)20150571

25 Huan T Meng Q Saleh MA et al Integrative network ana-lysis reveals molecular mechanisms of blood pressure regu-lation Mol Syst Biol 201511(4)799

26 Langfelder P Horvath S WGCNA an R package for weightedcorrelation network analysis BMC Bioinformatics 20089(1)559

27 Ko Y Cho M Lee JS et al Identification of disease comorbid-ity through hidden molecular mechanisms Sci Rep 20166(1)39433

28 Sun K Buchan N Larminie C et al The integrated diseasenetwork Integr Biol 20146(11)1069ndash79

29 Park SJ Lee KS Kim SR et al AMPK activation reduces vascu-lar permeability and airway inflammation by regulating HIFVEGFA pathway in a murine model of toluene diisocyanate-induced asthma Inflamm Res 201261(10)1069ndash83

30 Ohno I Nitta Y Yamauchi K et al Transforming growth fac-tor beta 1 (TGF beta 1) gene expression by eosinophils inasthmatic airway inflammation Am J Respir Cell Mol Biol199615(3)404ndash9

31 Barnes PJ The cytokine network in asthma and chronic ob-structive pulmonary disease J Clin Invest 2008118(11)3546ndash56

32 Martin RJ Nocturnal asthma circadian rhythms and thera-peutic interventions Am Rev Respir Dis 1993147(6 Pt 2)S25ndash8

33 Szczepankiewicz A BreRborowicz A Skibinska M et alAssociation analysis of tyrosine kinase FYN gene poly-morphisms in asthmatic children Int Arch Allergy Immunol2008145(1)43ndash7

34 Padron-Morales J Sanz C Davila I et al Polymorphisms ofthe IL12B IL1B and TNFA genes and susceptibility toasthma J Investig Allergol Clin Immunol 201323(7)487ndash94

35 Xie S Sukkar MB Issa R et al Mechanisms of induction ofairway smooth muscle hyperplasia by transforming growthfactor-b Am J Physiol Lung Cell Mol Physiol 2007293(1)L245ndash53

36 Cao M Zhang H Park J et al Going the distance for proteinfunction prediction a new distance metric for protein inter-action networks PLoS One 20138(10)e76339

37 Kwak YG Song CH Ho KY et al Involvement of PTEN in air-way hyperresponsiveness and inflammation in bronchialasthma J Clin Invest 2003111(7)1083ndash92

38 Wang J Li F Yang M et al FIZZ1 promotes airway remodel-ing through the PI3KAkt signaling pathway in asthma ExpTher Med 20147(5)1265ndash70

39 Antony P Diederich NJ Kruger R et al The hallmarks ofParkinsonrsquos disease FEBS J 2013280(23)5981ndash93

40 Mizuno S Iijima R Ogishima S et al AlzPathway a compre-hensive map of signaling pathways of Alzheimerrsquos diseaseBMC Syst Biol 20126(1)52

41 Le Novere N Hucka M Mi H et al The systems biologygraphical notation Nat Biotechnol 2009 27735ndash41

42 Fujita KA Ostaszewski M Matsuoka Y et al Integratingpathways of Parkinsonrsquos disease in a molecular interactionmap Mol Neurobiol 201449(1)88ndash102

43 Satagopam V Gu W Eifes S et al Integration and visualiza-tion of translational medicine data for better understandingof human diseases Big Data 20164(2)97ndash108

44 Kuperstein I Bonnet E Nguyen H et al Atlas of cancer signallingnetwork a systems biology resource for integrative analysis ofcancer data with Google Maps Oncogenesis 20154(7)e160

45 Kutmon M Riutta A Nunes N et al WikiPathways capturingthe full diversity of pathway knowledge Nucleic Acids Res2015 44D488ndash94

46 Pon A Jewison T Su Y et al Pathways with PathWhizNucleic Acids Res 201543(W1)W552ndash9

47 Paley S OrsquoMaille PE Weaver D et al Pathway collages per-sonalized multi-pathway diagrams BMC Bioinformatics 201617(1)529

48 Hucka M Finney A Sauro HM et al The systems biologymarkup language (SBML) a medium for representation andexchange of biochemical network models Bioinformatics200319524ndash31

49 Demir E Cary MP Paley S et al The BioPAX community stand-ard for pathway data sharing Nat Biotechnol 201028935ndash42

50 Ruebenacker O Systems biology pathway exchange(SBPAX) In Encyclopedia of Systems Biology New YorkSpringer-Verlag 2013 2064ndash8

51 Pratt D Chen J Welker D et al NDEx the network data ex-change Cell Syst 20151(4)302ndash5

52 Smoot ME Ono K Ruscheinski J et al Cytoscape 28 newfeatures for data integration and network visualizationBioinformatics 201027431ndash2

53 Paull EO Carlin DE Niepel M et al Discovering causal path-ways linking genomic events to transcriptional states usingTied Diffusion Through Interacting Events (TieDIE)Bioinformatics 201329(21)2757ndash64

54 Mao XQ Kawai M Yamashita T et al Imbalance productionbetween interleukin-1b (IL-1b) and IL-1 receptor antagonist(IL-1ra) in bronchial asthma Biochem Biophys Res Commun2000276(2)607ndash12

55 Slater T Song D Saved by the BEL ringing in a common lan-guage for the life sciences Drug Discov World Fall 20128075ndash80

56 Ashburner M Ball CA Blake JA et al Gene Ontology tool forthe unification of biology Nat Genet 200025(1)25ndash9

57 Hastings J de Matos P Dekker A et al The ChEBI referencedatabase and ontology for biologically relevant chemistryenhancements for 2013 Nucleic Acids Res 201241D456ndash63

58 Catlett NL Bargnesi AJ Ungerer S et al Reverse causal rea-soning applying qualitative causal knowledge to the inter-pretation of high-throughput data BMC Bioinformatics 201314(1)340

Navigating the disease landscape | 13

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 14: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

59 Kodamullil AT Younesi E Naz M et al Computable cause-and-effect models of healthy and Alzheimerrsquos disease statesand their mechanistic differential analysis AlzheimersDement 201511(11)1329ndash39

60 Malhotra A Younesi E Bagewadi S et al Linking hypothet-ical knowledge patterns to disease molecular signatures forbiomarker discovery in Alzheimerrsquos disease Genome Med20146(11)97

61 Naz M Kodamullil AT Hofmann-Apitius M Reasoning overgenetic variance information in cause-and-effect models ofneurodegenerative diseases Brief Bioinform 201617(3)505ndash16

62 Younesi E Hofmann-Apitius M From integrative diseasemodeling to predictive preventive personalized and par-ticipatory (P4) medicine EPMA J 20134(1)23

63 Li C Liakata M Rebholz-Schuhmann D Biological networkextraction from scientific literature state of the art and chal-lenges Brief Bioinform 201415(5)856ndash77

64 Lai PT Lo YY Huang MS et al BelSmile a biomedical se-mantic role labeling approach for extracting biological ex-pression language from text Database 20162016baw064

65 Groth P Gibson A Velterop J The anatomy of a nanopublica-tion Inf Serv Use 201030(1ndash2)51ndash6

66 Kramer A Green J Pollard J Jr et al Causal analysisapproaches in ingenuity pathway analysis Bioinformatics201430(4)523ndash30

67 Cano I Lluch-Ariet M Gomez-Cabrero D et al Biomedical re-search in a digital health framework J Transl Med 201412(Suppl 2)S10

68 Maier D Kalus W Wolff M et al Knowledge managementfor systems biology a general and visually driven frame-work applied to translational medicine BMC Syst Biol 20115(1)38

69 Gomez-Cabrero D Abugessaisa I Maier D et al Data integra-tion in the era of omics current and future challenges BMCSyst Biol 20148(Suppl 2)I1

70 Cano I Tenyi A Schueller C et al The COPD knowledge baseenabling data analysis and computational simulation intranslational COPD research J Transl Med 201412(Suppl 2)S6

71 Rappaport N Nativ N Stelzer G et al MalaCards an inte-grated compendium for diseases and their annotationDatabase 20132013(0)bat018

72 Ben-Ari Fuchs S Lieder I Stelzer G et al GeneAnalytics anintegrative gene set analysis tool for next generationsequencing RNAseq and microarray data OMICS 201620(3)139ndash51

73 Rebhan M Chalifa-Caspi V Prilusky J et al GeneCards anovel functional genomics compendium with automateddata mining and query reformulation support Bioinformatics199814(8)656ndash64

74 Partner J Vukotic A Watt N Neo4j in Action Valencia SpainManning 2015

75 Lysenko A Roznovat IA Saqi M et al Representing andquerying disease networks using graph databases BioDataMin 20169(1)23

76 Pareja-Tobes P Pareja-Tobes E Manrique M et al Bio4J anopen source biological data integration platform InProceedings of the IWBBIO Copicentro Granada Granada2013 281

77 Balaur I Mazein A Saqi M et al Recon2Neo4j applyinggraph database technologies for managing comprehensivegenome-scale networks Bioinformatics 2016331096ndash8

78 Hoksza D Jelınek J Using Neo4j for mining protein graphs acase study In Proceedings of the 26th International Workshop

on Database and Expert Systems Applications (DEXA) 2015IEEE Valencia Spain 2015 230ndash4

79 Berners-Lee T Hendler J Lassila O The semantic web SciAm 200128(5)34ndash43

80 Lassila O Swick RR Resource description framework (RDF)model and syntax specification W3C (Online) httpwwww3orgTR1999REC-rdf-syntax-19990222 1999

81 Berners-Lee T Linked data W3C (Online) httpswwww3orgDesignIssuesLinkedDatahtml 2006

82 McGuinness DL Van Harmelen F OWL web ontology lan-guage overview W3C Recommendation W3C (Online)httpwwww3orgTR2004REC-owl-features-200402102004

83 Jupp S Malone J Bolleman J et al The EBI RDF platformlinked open data for the life sciences Bioinformatics 201430(9)1338ndash9

84 UniProt Consortium The universal protein resource(UniProt) Nucleic Acids Res 200836D190ndash5

85 Joshi-Tope G Gillespie M Vastrik I et al Reactome a knowl-edgebase of biological pathways Nucleic Acids Res 200533(Database issue)D428ndash32

86 Yates A Akanni W Amode MR et al Ensembl 2016 NucleicAcids Res 201644(D1)D710ndash16

87 Gaulton A Bellis LJ Bento AP et al ChEMBL a large-scalebioactivity database for drug discovery Nucleic Acids Res201240(D1)D1100ndash7

88 Kapushesky M Emam I Holloway E et al Gene expressionatlas at the European bioinformatics institute Nucleic AcidsRes 201038(Suppl 1)D690ndash8

89 Callahan A Cruz-Toledo J Ansell P et al Bio2RDF release 2improved coverage interoperability and provenance of lifescience linked data In Proceedings of the Extended SemanticWeb Conference Springer Montpellier France 2013 200ndash12

90 Williams AJ Harland L Groth P et al Open PHACTS seman-tic interoperability for drug discovery Drug Discov Today201217(21ndash2)1188ndash98

91 Malone J Holloway E Adamusiak T et al Modeling samplevariables with an Experimental Factor OntologyBioinformatics 201026(8)1112ndash18

92 Juty N le Novere N Systems biology ontology InEncyclopedia of Systems Biology New York Springer-Verlag2013

93 Dumontier M Andersson B Batchelor C et al TheTranslational Medicine Ontology Driving personalizedmedicine by bridging the gap from bedside to bench InProceedings of the 13th Annual Bio-Ontologies Meeting BostonUSA Bio-ontologies 2010 120ndash3

94 Cote R Rothwell DJ Palotay JL et al The SystemisedNomenclature of Medicine SNOMED International Northfield ILCollege of American Pathologists 1993

95 Himmelstein DS Lizee A Hessler C et al Systematic integra-tion of biomedical knowledge prioritizes drugs for repurpos-ing Elife 20176e26726

96 Fiel SB Vincken W Systemic corticosteroid therapy foracute asthma exacerbations J Asthma 200643(5)321ndash31

97 Mohn D Elliott R Powers D et al The anthelminthic niclosa-mide and related compounds represent potent Tmem16aantagonists that fully relax mouse and human airway ringsAm J Respir Crit Care Med 2017195A7652

98 Okamura Y Aoki Y Obayashi T et al COXPRESdb in 2015coexpression database for animal species by DNA-microarray and RNAseq-based expression data with mul-tiple quality assessment systems Nucleic Acids Res 201543(D1)D82ndash6

14 | Saqi et al

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019

Page 15: Navigating the disease landscape: knowledge ...persagen.com/files/misc/saqi2018navigating.pdf · context of molecular signatures, in particular three main approaches, namely, pathway

99 Wang P Qi H Song S et al ImmuCo a database of gene co-expression in immune cells Nucleic Acids Res 201543(D1)D1133ndash9

100 Altenhoff AM Skunca N Glover N et al The OMA orthologydatabase in 2015 function predictions better plant supportsynteny view and other improvements Nucleic Acids Res201543(D1)D240ndash9

101 Huerta-Cepas J Szklarczyk D Forslund K et al eggNOG 45 ahierarchical orthology framework with improved functionalannotations for eukaryotic prokaryotic and viral sequencesNucleic Acids Res 201644(D1)D286ndash93

102 Rebholz-Schuhmann D Oellrich A Hoehndorf R Text-min-ing solutions for biomedical research enabling integrativebiology Nat Rev Genet 201213(12)829ndash39

103 Singhal A Leaman R Catlett N et al Pressing needs of bio-medical text mining in biocuration and beyond opportuni-ties and challenges Database 20162016baw161

104 Huang CC Lu Z Community challenges in biomedical textmining over 10 years success failure and the future BriefBioinform 201617(1)132ndash44

105 Xie B Ding Q Han H et al miRCancer a microRNAndashcancerassociation database constructed by text mining on litera-ture Bioinformatics 201329(5)638ndash44

106 Pletscher-Frankild S Palleja A Tsafou K et al DISEASES textmining and data integration of diseasendashgene associationsMethods 20157483ndash9

107 Fluck J Hofmann-Apitius M Text mining for systems biol-ogy Drug Discov Today 201419(2)140ndash4

108 Ono T Hishigaki H Tanigami A et al Automated extractionof information on proteinndashprotein interactions from the bio-logical literature Bioinformatics 200117(2)155ndash61

109 Li J Zhu X Chen JY Building disease-specific drug-proteinconnectivity maps from molecular interaction networksand PubMed abstracts PLoS Comput Biol 20095(7)e1000450

110 Percha B Garten Y Altman RB Discovery and explanationof drug-drug interactions via text mining In PacificSymposium on Biocomputing Big Island Hawaii NIH PublicAccess 2012 410

111 Tari L Anwar S Liang S et al Discovering drugndashdrug inter-actions a text-mining and reasoning approach based onproperties of drug metabolism Bioinformatics 201026(18)i547ndash53

112 Caporaso JG Baumgartner WA Jr Randolph DA et alMutationFinder a high-performance system for extractingpoint mutation mentions from text Bioinformatics 200723(14)1862ndash5

113 Wei CH Harris BR Kao HY et al tmVar a text mining ap-proach for extracting sequence variants in biomedical litera-ture Bioinformatics 201329(11)1433ndash9

114 Adamic LA Wilkinson D Huberman BA et al A literaturebased method for identifying gene-disease connections In

Proceedings of the IEEE Computer Society on BioinformaticsConference 2002 IEEE Stanford USA 2002 109ndash17

115 Van Driel MA Bruggeman J Vriend G et al A text-mininganalysis of the human phenome Eur J Hum Genet 200614(5)535ndash42

116 Korbel JO Doerks T Jensen LJ et al Systematic associationof genes to phenotypes by genome and literature miningPLoS Biol 20053(5)e134

117 Szklarczyk D Franceschini A Wyder S et al STRING v10proteinndashprotein interaction networks integrated over thetree of life Nucleic Acids Res 201543(D1)D447ndash52

118 Wiegers TC Davis AP Mattingly CJ Web services-basedtext-mining demonstrates broad impacts for interoperabil-ity and process simplification Database 20142014(0)bau050

119 Comeau DC Islamaj Dogan R Ciccarese P et al BioC a min-imalist approach to interoperability for biomedical text pro-cessing Database 20132013bat064

120 Madan S Hodapp S Senger P et al The BEL information ex-traction workflow (BELIEF) evaluation in the BioCreative VBEL and IAT track Database 20162016baw136

121 Fluck J Madan S Ansari S et al Training and evaluation cor-pora for the extraction of causal relationships encoded inbiological expression language (BEL) Database 20162016baw113

122 Ravikumar K Rastegar-Mojarad M Liu H BELMiner adapt-ing a rule-based relation extraction system to extract biolo-gical expression language statements from bio-medicalliterature evidence sentences Database 20172017(1)baw156

123 Fleuren WW Alkema W Application of text mining in thebiomedical domain Methods 20157497ndash106

124 DiMasi JA Grabowski HG Hansen RW Innovation in thepharmaceutical industry new estimates of RampD costsJ Health Econ 20164720ndash33

125 Hay M Thomas DW Craighead JL et al Clinical develop-ment success rates for investigational drugs Nat Biotechnol201432(1)40ndash51

126 Fordyce CB Roe MT Ahmad T et al Cardiovascular drug de-velopment is it dead or just hibernating J Am Coll Cardiol201565(15)1567ndash82

127 Cummings JL Morstorf T Zhong K Alzheimerrsquos diseasedrug-development pipeline few candidates frequent fail-ures Alzheimers Res Ther 20146(4)37

128 Kaitin KI Deconstructing the drug development process thenew face of innovation Clin Pharmacol Ther 201087(3)356ndash61

129 Halappanavar S Vogel U Wallin H et al Promise andperil in nanomedicine the challenges and needs for inte-grated systems biology approaches to define health riskWiley Interdiscip Rev Nanomed Nanobiotechnol 201810(1)e1465

Navigating the disease landscape | 15

Dow

nloaded from httpsacadem

icoupcombibadvance-article-abstractdoi101093bibbby0254978142 by guest on 08 January 2019