Designing an IT infrastructure for data-intensive collaborative -omics projects
Stathis [email protected]
European Bioinformatics InstituteCambridge, UK
ICTA 2011
Outline
• Introduction• Why design at all?• Principles of collaborative design• A software suite for cross-disciplinary
collaborative studies• Results• Conclusions
INTRODUCTION
The “central dogma” of information flow in molecular biology
DNA RNA ProteinTranscription
(RNA Synthesis)Translation
(Protein Synthesis)
Replication(DNA Synthesis)
Source: http://www.rsc.org/chemistryworld/Issues/2009/November/BiologysNobelMoleculeFactory.asp
The -omics cascade
GENOMICS
What CAN happen
TRANSCRIPTOMICS
What APPEARS to happen
PROTEOME
What MAKES it happen
METABOLOME
What HAS happened
Source: Systems Biology and the Omics Cascade, Karolinska Institutet, June 9-13, 2008
PHENOTYPE
http://xkcd.com/793/
407-omes and -omics
terms1
Sources:1 http://omics.org/index.php/Alphabetically_ordered_list_of_omes_and_omics2 http://www.ensemblgenomes.org/3 http://www.genome.gov/sequencingcosts/4 http://en.wikipedia.org/wiki/Interdisciplinarity
330Genomes
sequenced to date2
3BSize of human
genome in bases
$10kCost to sequence a single human3
30kInterdisciplinary
bachelors degrees awarded in 2005 in
USA4
2006 2007 2008 2009 2010 2011
Trends in publication keywords in the field of bioinformatics
semantic
linked data
2006 2007 2008 2009 2010 2011
Trends in publication keywords in the field of bioinformatics
cloud
server
2006 2007 2008 2009 2010 2011
Trends in publication keywords in the field of bioinformatics
omics
genomics
Challenges in -omics research
• Expensive studies– Small number of replicates (n)
(microarrays, subjects...)
– Large number of variables(genes, proteins, etc)
• This results in:– Inflated type I error (false positives)– Poor statistical Power (true positives)
WHY DESIGN AT ALL?http://xkcd.com/970/
Volume vs Complexity cost model
Project Samples Research subjects
Studies/data types
Assays Files/volume
Users/roles/user groups
Publ-s per year
MolPAGE
16.5k 2.2k 300/11 26 000/11
27 000/0.7 TB
80/1/1 1
ENGAGE
>100k 100k 400/13 *** 400/0.25 TB
30/5/13 10
V
C~ data types*user roles*scripts
volume
complexity
Growth of complexity is slower than volume
Both volume and complexity grow fast
Maria Krestyaninova, 2009
Ome vs Omics
Source: http://omics.org/index.php/File:Ome_versus_omics_graph_by_Jong_Bhak_openfree.gif
$3,000,000,000
Cost
$10,000
~$0
2003 2016Ome and Omics
Balance point2010
$50,000 per person
Reporting requirements for publication
Phenotypes/conditions or outcomes considered in a study
Statistical methods/protocols used in
a study
HTP data used for association (e.g. GWA)genomics
Raw dataProcessed data
Results of analysis
Omics investigation
DataShaper, OBO
ISATAB, MAGETAB, MIBBI
Bioconductor
Nobody wants a cellphone that makes calls!
Make your application:1. Contextualized2. Usable3. Enjoyable4. Visible (increases reputation)5. Sociable6. Valuable7. Explorable8. Flexible9. In a participatory way10. …
OPEN-SOURCE COLLABORATIVE DESIGN
Maxims of the post-information era
• “If the news is important, it will find me”• “Information wants to be free”• “Its not information overload, its filter failure”• “The people formerly known as the audience”• “The sources go direct”• and finally…
Source: http://markcoddington.com/2010/01/30/a-quick-guide-to-the-maxims-of-new-media/
“Do what you do best, link the rest”
http://xkcd.com/974/
Agile development
Individuals & interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
• In practice: frequent iterations over customer feedback, trust
Metadesign
Participation level
Analysis Concept design
Concept communication
Distribution End-of-life
none
indirect
consultative
Shared control
Full control
Courtesy of Massimo Menichinelli http://www.openp2pdesign.org/
SOFTWARE FOR CROSS-DISCIPLINARY COLLABORATIVE STUDIES
SIMBioMS
The big picture
CENTRAL DATA ARCHIVES
SIMBIOM
SOBIBA
ISA
QURETECMETABAR
etc.
• dynamic storage• project hosting• fast exchange
• permanent deposition• large volumes• open access
support for collaborative
discovery
knowledge access and
sustainability
large consortia
stand alone researchers
Maria Krestyaninova, 2009
USERS
DATA PROVIDERS
System overview
Biobanks
-omics
Experiment DB
Sample DB
Public Index
submission
submission
controlled access
open access
Maria Krestyaninova, 2009
Current infrastructural volume
• 12 installations in 3 countries• 100 user-organisations• >50.000 samples• >50.000 assays and studies• 4 large federated R&D projects across Europe
and Russia
Krestyaninova et al, Bioinformatics, 2009Viksna et al, BMC Bioinformatics, 2007
SIMBIOMS in collaborative biomedical research initiativesProject Goal/Description Funded by Simbioms team involvement
Strategic research collaborations
BBMRIwww.bbmri.eu Build a network of population-based biobanks,
experts, and foster collaboration between them. Provide advice to industry.
EC, OECD Prototyping of data management model, use-case design, discussions.
P3Gwww.p3g.org
Canadian Gov., memberships
Leading international Informatics Working Group; discussions.
ELIXIRwww.elixir-europe.org/page.php
Create a sustainable infrastructure for the storage and distribution of information produced by bioscientists. EC
Prototyping, reports, cooperation with organisation of medical informatics committee on behalf of EBI.
TaraOceans oceans.taraexpeditions.org 3-year long circumnavigation expedition for marine
genomics and climate integrative study.CNRS, industry, potentially EC
Preliminary design of data management solution; meetings, discussions.
Services for research collaborations
ENGAGEwww.euengage.org Genetic and genomic research for clinical application. EC
Design, development and maintenance of dedicated data exchange services – based on SIMBioMS.
MolPAGEwww.molpage.org
Biomarkers: discovery and development of novel high-throughput methods. EC
MuTHER Exploration of gene expression in multiple tissues on 1000 twins associated with aging. Wellcome Trust
SIROCCOwww.sirocco-project.eu
Study of small RNAs as regulatory cell mechanism; therapeutical applications. EC
CAGEKID Kidney cancer study. EC
SUMMIT Surrogate markers for vascular Micro- &Macrovascular hard endpoints for Innovative diabetes Tools EC
Anton Enright, 2011
CONCLUSIONS
Complex interactions
• Who has a say in knowledge extracted from information?– Research subjects
• Consent to particular research being conducted
– Scientists• Protective of vision about their data
– Funding sources• Expect publications from grantees
Pharma
BioBanksResearch Institutions
big data
industryacademia
state
FDA
Ministry of Health Ministry of Education
Yulia Tammisto, 2011
Complex software
• TIME is the scarcest resource• Software adoption due to:
– Requirements – No other way to do things – Usefulness
• Use = 1 – Reuse
One goal
Search for the truth
Thank you!
Acknowledgements:
• Maria Krestyaninova• Ugis Sarkans• Anton Enright• Mat Davis• Yulia Tammisto• Massimo Menichinelli• Teemu Perheentupa• Jani Heikkinen• Balaji Rajashekar• Raivo Kolde• Jaak Vilo
Uniquer
www.simbioms.org