Download - Providing Bioinformatics Services on Cloud
Christophe Blanchet, Clément Gauthey
Infrastructure Distributed for BiologyIDB-IBCP CNRS FR3302 - LYON - FRANCE
http://idee-b.ibcp.fr
IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552) and the French National Research Agency's Arpege Programme (ANR-10-SEGI-001)
Providing Bioinformatics Serviceson Cloud
C. Blanchet and C. GautheyEGI CF13, Manchester, 9 April 2013
Infrastructure Distributed for Biology - IDB
CNRS-IBCP FR3302, Lyon, FRANCE
EGI CF13, Manchester, 9 April 2013
Bioinformatics Today• Biological data are big data
• 1512 online databases (NAR Database Issue 2013)
• Institut Sanger, UK, 5 PB
• Beijing Genome Institute, China, 4 sites, 10 PB➡ Big data in lot of places
• Analysing such data became difficult• Scale-up of the analyses : gene/protein to complete genome/
proteome, ...
• Lot of different daily-used tools
• That need to be combined in workflows
• Usual interfaces: portals, Web services, federation,...➡ Datacenters with ease of access/use
• Distributed resources• Experimental platforms: NGS, imaging, ...
• Bioinformatics platforms➡ Federation of datacenters
ADN
ADN
BI
M
M
ADN
ADN
BI
ADN
ADN
BI CC
BI
M
ADN
ADN
ADN
EGI CF13, Manchester, 9 April 2013
Sequencing Genomes
source: www.politigenomics.com/next-generation-sequencing-informatics
Complete genome sequencing become a lab commodity with
NGS (cheap and efficient)
source: www.genomesonline.org
EGI CF13, Manchester, 9 April 2013
Infrastructures in Biology
Lot of toolsand web servicesto treat and vizualize
lot of data
EGI CF13, Manchester, 9 April 2013
The scene
• Bioinformatics services providers• Is it easy to deploy lot of (incompatible) tools ?
• To make them connected to public databases ?
• To limit transfer of huge data ?
• To provide users with their own computing resources ?
• With their own isolated storage ?
• Scientists• Is it easy to access/use these tools ?
• To adapt to your usage ?
• To get your/other tools deployed on a datacenter ?
• To combine them ?
• To get my own computing/storage resources ?
EGI CF13, Manchester, 9 April 2013
IDB’s Cloud
• Cloud workbench for Biology• 13 turnkey bioinformatics appliances (as of Apr. 2013)
• Running since Sept. 2011, opened to Biology community
• Lyon, FRANCE
• Powered by• StratusLab
• Compute nodes, Block storage
• +900 cores, +4TB RAM, 36TB vdisks
• Mainly Intel SandyBridge servers with 32c 128GB
• Bigmen servers with 64c 768GB
• VMs from 1 to 64c, 512MB to 760GB RAM
• + Openstack
• Object storage (Swift)
• +200 TB redundant & scalable storage
EGI CF13, Manchester, 9 April 2013
Driven throught a simple web interface
EGI CF13, Manchester, 9 April 2013
Integrate Bioinformatics Tools in Cloud
BLAST
GOR4
FastASSearch
Abyss
ClustalW
Bioinformatics
Tools
RayBWA
PhyML RedHat,CentOS
Debian,Ubuntu
Suse
LinuxVirtual machines
Createnew
Appliance
Bioinformatics Marketplace
NGSStructure Galaxy ARIA (…)Sequence
• Appliances are virtual machines• small : few GB, easy to convert in most virtualization formats
• Installed and pre-configured with common bioinformatics tools• e.g. BLAST, Clustalw, ARIA, MEME, HMMer, TopHat, BWA, Samtools, etc.
EGI CF13, Manchester, 9 April 2013
Bioinformatics Appliances
EGI CF13, Manchester, 9 April 2013
Select your bioinformatics tools
EGI CF13, Manchester, 9 April 2013
Run Bioinformatics Cloud InstancesBioinformatics Marketplace
NGSStructure Galaxy ARIA (…)Sequence
IBCP's CloudResources
BLAST,Clustal,
etc.
PaaS
WorkersVM CNS
Shar
ed F
S
launch jobssshIaaS
Master & StorageVM ARIA
Portal
Laun
chIn
stan
ces
EGI CF13, Manchester, 9 April 2013
Manage your Cloud Instances
EGI CF13, Manchester, 9 April 2013
UNIPROT
PDB
EMBLPROSITE
Genomes
Public
Data sources
BioinformaticsCloud
BLAST,Clustal,
etc.
PaaS
WorkersVM CNS
Shar
ed F
S
launch jobssshIaaS
Master & StorageVM ARIA
Portal
shared(NFS)
User
Persistent data
pdisk(iSCSI)
Biological Data in CloudUpload your data
Get your results
scp http/S3
scp http/S3
EGI CF13, Manchester, 9 April 2013
Example: ‘biocompute’ Appliance
• Use your own instance(s)
• With pre-installed standard bioinformatics tools• BLAST, FastA, SSearch,HMM,...
• ClustalW2, Clustal-Omega, Muscle,..
• Bowtie(2), BWA, samtools, ...
• MEME, R, etc.
• Connected to public reference data• Uniprot, EMBL, genomes, PDB, etc.
• Automaticaly shared to the VMs
EGI CF13, Manchester, 9 April 2013
Example: Galaxy portal for NGS analyses
• Analyse NGS data
• portal Galaxy is widely used in the community
• connected to large public data: sequences and indexes
• large user data (GBs)
• Preserve workflows and results (persistent storage)
EGI CF13, Manchester, 9 April 2013
Example: Proteomics• Motivation
• Collaboration with a mass spectroscopy platform
• Running out of space on their local resources
• Protein identification• Mass experimental data
• Reference databases : nr, Swiss-Prot
• Reference screening tools:OMSSA, X!Tandem
• User interface• Remote display
• NX
• Reference GUIs
• SearchGUI
• PeptidShaker
source: PeptideShaker site
EGI CF13, Manchester, 9 April 2013
Conclusion• Provide turnkey bioinformatics appliances
• Standard tools and pipelines
• Interoperability: ready to run on cloud
• Easier to transfer appliances than data (GB vs TB)
• Provide a cloud infrastructure tightly connected to existing bioinformatics infrastructure• Public IDB’s bioinformatics cloud
• Linked to public biological databases
• In collaboration with the French Bioinformatics Institute
• Ease the usage by scientists• Usual bioinformatics gateways
• Persistent and large ubiquitous storage
• Web interface for cloud management
EGI CF13, Manchester, 9 April 2013
Perspectives• Define good practices to provide academic
community and industry with bioinformatics services!
• French Bioinformatics Institute - IFB• Goals are to provide core bioinformatics resources to the
national and international life science research community in key fields such as genomics, proteomics, systems biology, etc.
• Aims at building a national academic cloud devoted to Bioinformatics, inspired by the model evaluated through the IDB’s cloud.
• European ELIXIR infrastructure• To build a sustainable European infrastructure for biological
information, supporting life science research and its translation
• IFB will be the French representative in ELIXIR.
EGI CF13, Manchester, 9 April 2013
• Acknowledgment
• StratusLab members
• co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552) and by the French National Research Agency's Arpege Programme (ANR-10-SEGI-001).
Questions ?
http://idee-b.ibcp.fr