structuring what we know and use that to better understand...
TRANSCRIPT
Structuring what we know and use that to better understand your data
@Chris_Evelo: Department of Bioinformatics – BiGCaT,
WikiPathways team, ELIXIR Interoperability team, Open PHACTS
So many…
ELIXIR, EXCELERATE, CORBEL, GA4GH, EGA, dbNP, ENPADASI, DISH, Open PHACTS, BBMRI, DRE, EuroCAT, DTL, EATRIS, DiXa, UniProt, PDB, CheBI, ChEMBL, HMDB, ISA, FAIR, RDF, VOID, Nanopubs, eNanomapper, KEGG, Reactome, Entrez, Parelsnoer, Arrayexpress, GEO, ENCODE, Recon2, SMBL, SBGN, MIM
And that is just what I discussed yesterday…
The typical question we get about using big data
We can do things like this (diabetic liver)
Pihlajamäki et al. dataset is from Gene Expression Omnibus (accession number GSE15653)
Pihlajamäki et al. J ClinEndocrinol Metab. 2009, 94 (9): 3521-3529. DOI: 10.1210/jc.2009-0212.
Martina Kutmon et al.BMC Genomics 2014, 15:971.DOI: 10.1186/1471-2164-15-971
Data predators
Data: Wang et al. 2011. in Gene Expression Omnibus (GEO, http://ncbi.nlm.nih.gov/geo/, accession number: GSE17461.
Published paper: Effects of 1alpha,25 dihydroxyvitamin D3 and testosterone on miRNA and mRNA expression in LNCaP cells. WL Wang et al. Mol Cancer 2011. 10. doi:10.1186/1476-4598-10-58
Or: Vitamin D effects on prostate cancer cells
Integrative network-based analysis of mRNA and microRNA expression in vitamin D3-treated cancer cells
Internal &external
datarepositories
e.g. dbNP,Sage, Atlas
knowledgeresources &
(semantic web)Integration
e.g. Open PHACTSWikiPathways
study capturingISA
models
studydataprocessing,statistics,storagee.g. arrayanalysis.org
ontologies
modeling & data integration,network biology (extension),supervised statistics
curation, simulation annotation &
provenance
Integrative Systems Biology
researchapplications
mappingBridgeDb
extraction,SPARQLingconversion
http://www.wikipathways.org/instance/WP430
http://www.wikipathways.org/index.php/Pathway:WP430
WikiPathways
• Public resource for biological pathways
• Anyone can contribute and curate
• More up-to-date representation of biological knowledge
WikiPathways: capturing the full diversity of pathway knowledge. M Kutmon et al
Nucleic Acids Res 2015: first published online: Oct 19.
Big data: Wikiomics. Mitch Waldrop. Nature 2008: 455, 22-25
We the curators. Allison Doerr. Nature Methods 2008: 5, 754–755
No rest for the bio-wikis. Ewen Callaway. Nature 2010: 468, 359-360
How to do interoperable data visualization?
Connect to Genome Databases
Backpages link to multiple databases
You could do this for gene lists
Don’t be afraid to reinvent wheels!
BridgeDb: Abstraction Layer
interface
IDMapper
class
IDMapperRdb
relational database
class
IDMapperFile
tab-delimited text
class
IDMapperBiomart
web service
The BridgeDb Framework: Standardized Access to Gene, Protein and Metabolite Identifier
Mapping Services. Martijn P van Iersel, Alexander R Pico, Thomas Kelder, Jianjiong Gao, Isaac Ho,
Kristina Hanspers, Bruce R Conklin, Chris T Evelo. BMC Bioinformatics 2010, 11: 5.
Combine: WikiPathways tissue analyzer
Work done by Jonathan Melius
WikiPathways, a house of webs?
Combine: adding miRNA’s clutters
Combine: regulator Interaction in MiPaSt PathVisio plugin
Work done by Christian Oertlin.
Pathways in Cytoscape
Figure 2. The Cardiac Hypertrophic Response pathway loaded as a network.
Kutmon M, Lotia S, Evelo CT and Pico AR 2014 [v1; ref status: indexed, http://f1000r.es/3ij] F1000Research 2014, 3:152 (doi: 10.12688/f1000research.4254.1)
PPS1
Liver
All pathways
Pathways with high z-score
grouped together.
Explains why there are
relatively few significant
genes, but many pathways
with high z-score.
Cytoscape visualization used to group
Pathway interactions and what causes them
Thomas Kelder, Lars Eijssen, Robert Kleemann, Marjan van Erk, Teake Kooistra, Chris Evelo
(2011) Exploring pathway interactions in insulin resistant mouse liver.
BMC Systems Biology 5: 127 Aug. http://dx.doi.org/doi:10.1186/1752-0509-5-127
Pathway interactions and
detailed network visualization
for the interactions with three
apoptosis related pathways for
the comparison between HF and
LF diet at t = 0. A: Subgraph of the
pathway interaction network, based
on incoming interactions to three
stress response and apoptosis
pathways with the highest in-
degree. Pathway nodes with a thick
border are significantly enriched (p
< 0.05) with differentially expressed
genes. B: The protein interactions
that compose the interactions
between the three apoptosis
related pathways and their
neighbors in the subgraph as
shown in box A (see inset, included
interactions are colored orange).
Protein nodes have a thick border
when their encoding genes are
significantly differentially expressed
(q < 0.05).
Regulation resources
human ErbB signaling pathway extended with validated microRNA regulation
If we don’t do the magic
LiteraturePubChem
GenbankPatents
DatabasesDownloads
Data Analysis Data Integration Firewalled Databases
How do R&D companies use public data?
How do pharma companies use public data?
Pfizer
AZ
Roche
n
@gray_alasdair Big Data Integration 39
Semantic web grammar
Nanopub
Db
VoID
Data Cache (Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
Co
re P
latf
orm
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public
Ontologies
User
Annotations
Apps
Nanopub
Db
VoID
Data Cache (Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
Co
re P
latf
orm
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public
Ontologies
User
Annotations
Apps
Choose a standard
Link one resource to another
Or use both and map
Mapping tools are core tools: need funding and sustainability
Database identifier mapping tools we have:
• A software framework (BridgeDb)– Application in WikiPathways, PathVisio, Cytoscape, R/Bioconductor– An installable webservice– Open source– Community based– Database based (small)
• A semantic web implementation (Open PHACTS IMS)– With installable Docker image– Linkset based (fast)– Transitivity (and limits for that)
• gene -> protein -> has enzyme code• Protein -> has enzyme code -> other proteins
• Identifiers.org for ID schema’s and resolution
This is not just Open PHACTS
Federated SPARQL queries:
e.g. find all genes related to disease, then all pathways with these genes…
Used as hackaton (swat4ls) examples
Only works sometimes, by chance
Needs integrated ID mapping!
Ontology mapping• Many available, even as services
• Often integated in data resources
– Make my own, slim, combine, map, extend
– Needs feedback to original!
Metabolite mapping needs
• More mappings! (plant products, drugs, xenobiotics)
• Ontology based mapping (CheBi)
• Because:
– Palmitic acid is a fatty acid
– R,R,R-tocopherol is a form of Vitamin E
• And these should (sometimes) map
Also applies to biology:scientific lenses
Chemistry mapping
• Structure not ID based
• Allow substructure searches
• Open PHACTS open source ???
• We need it, may have to redo
From reproducibility to reusability
Reuse problems
The age distribution in the experimental groups were not significantly different…
Can we reuse that data to find out age effects?
Yes, if that is actually captured
Needs:Ontologies (bioportal)Principles/standards (FAIR, ISA)Capture tools (dbNP, Molgenis, OpenCLinica, eNotebooks)Study repositories (Biosamples, Biostudies)Data repositories (EGA, GEO, Arrayexpress, Metabolights, Pride)