irb galaxy cloudman radionica

71
Uvod u bioinformatičku analizu podataka s Galaxy aplikacijom Enis Afgan Institut Ruđer Bošković 30.9.2014.

Upload: enis-afgan

Post on 17-Jul-2015

100 views

Category:

Education


0 download

TRANSCRIPT

Uvod u bioinformatičku analizu podataka s Galaxy

aplikacijom

Enis Afgan Institut Ruđer Bošković

30.9.2014.

Svi mi •  U 30 sekundi ili manje recite svima

•  Vaše ime •  Vaš zavod / afiliaciju •  Nešto o Vašem znanstvenom radu •  Zašto ste ovdje / što se nadate da ćete naučiti

Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije

10:00-10:15 Q&A / pauza

10:15-10:30 Pokretanje vlastitog CloudMan klastera

10:30-11:30 Galaxy 101

11:30-11:45 Q&A / pauza

11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija

12:30-12:45 Anketa i AWS credits: 3x $100

Making sense of this data requires

sophisticated analysis environment

with

adequate computational infrastructure

that is

accessible to the researchers

while it ensures

reproducibility of scientific results.

Galaxy: accessible analysis system

What is Galaxy?

A data analysis and integration tool

A (free for everyone) web service integrating a wealth of tools, compute resources, terabytes of

reference data and permanent storage

Open source software that makes integrating your own tools and data and customizing for your own

site simple

Need an analysis? There’s a tool for that.

Running a tool -  Automatically generated

web UI from a tool wrapper (any tool can be integrated)

-  Integrated with other tools

Data analysis history

Galaxy Workflows

Reproducibility in Genomics 18 Nat. Genetics experiments in microarray gene expression

<50% of reproducible

Problems •  missing data (38%) •  missing software, hardware

details (50%) •  missing methods, processing

details (66%)

Ioannidis, J.P.A. et al. “Repeatability of published microarray gene expression analyses.” Nat Genet 41, 149-155 (2009)

14 re-sequencing experiments in Nat. Genetics, Nature, Science

0% reproducible?

Problems •  missing primary data (50%) •  tools unavailable (50%) •  missing parameter setting, tool

versions (100%)

"Devil in the details," Nature, vol. 470, 305-306 (2011).

Metadata = Reproducibility

Automatic metadata

Data provenance

User metadata

Sharing and Publishing

Three ways to use Galaxy

•  Public website

•  Download and Run Locally

•  Run on the Cloud

http://usegalaxy.org (a.k.a. Main)

•  Public web site

•  Anybody can use it

•  Hundreds of tools

•  Persistent

•  +500 users/month

•  ~200TB of user data

•  ~140,000 analysis jobs / month

http://bit.ly/gxystats

Public Galaxy Servers https://wiki.galaxyproject.org/PublicGalaxyServers

Interested in:

ChIP-chip and ChIP-seq? ü  Cistrome

Statistical Analysis?

ü  Genomic Hyperbrowser

Sequence and tiling arrays?

ü  Oqtans

Text Mining?

ü  DBCLS Galaxy

Reasoning with ontologies?

ü  GO Galaxy

Internally symmetric protein structures?

ü  SymD

getgalaxy.org

GConfiguration

Local installation

Compute clusters

•  A number of connected computers

•  Typically built from commodity components

•  Used to improve performance: throughput or speed (supercomputers)

ALL GOOD, RIGHT?

Two challenges still exist

Infrastructure Customization

cloudman.irb.hr

AWS

OpenStack

Eucalyptus

GConfiguration

Cloud Computing •  Dynamically scalable shared resources accessed over a network

•  Control infrastructure via API

•  Private, public, or hybrid

•  Virtually unlimited resources: storage, computing, services •  Only pay for what you use

What is CloudMan?

CloudMan allows one to create a compute cluster in the cloud, use pre-configured applications, or add

one’s own. And then share it all.

Deploying a CloudMan Platform

1.  An account on the supported cloud

2.  Start a master instance via a launcher app or the cloud web dashboard

3.  Use the CloudMan web interface on the master instance to manage the platform

Manage Your Cluster

Share Your Instance •  Share entire (Galaxy) CloudMan platform

•  Even the customized ones (including data and/or tools)

•  Fully automated solution

•  Publish a self-contained analysis •  In progress or otherwise

How much does the Cloud cost?

Amazon Web Services •  $0.14 per CPU hour (~$100 per CPU month) •  $0.05 per GB-month (~$50 per TB-month)

Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije

10:00-10:15 Q&A / pauza

10:15-10:30 Pokretanje vlastitog CloudMan klastera

10:30-11:30 Galaxy 101

11:30-11:45 Q&A / pauza

11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija

12:30-12:45 Anketa i AWS credits: 3x $100

Rad s vlastitim CloudMan klasterom •  Launch an instance •  Demonstrate the following CloudMan

features and prepare for the data analysis part: •  Manual & Auto-scaling •  Using an S3 bucket as a data source •  Accessing an instance over ssh •  Customizing an instance •  Controlling Galaxy •  Sharing-an-instance

•  Perform data analysis in Galaxy •  Find exons with most SNPs

Inte

rac

tio

n fl

ow

YOUR TURN

Launch an instance 1.  Slides @ bit.ly/irb-ws 2.  Load biocloudcentral.org 3.  Enter the access key and secret key

provided at http://bit.ly/ws-creds

4.  Provide your email address 5.  Use your initials as the cluster name 6.  Set any password (and remember it) 7.  Use Large instance type 8.  Start your instance

Wait for the instance to start (~2-3 minutes)

9.  Access Galaxy application For more details, see

http://cloudman.irb.hr

Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije

10:00-10:15 Q&A / pauza

10:15-10:30 Pokretanje vlastitog CloudMan klastera

10:30-11:30 Galaxy 101

11:30-11:45 Q&A / pauza

11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija

12:30-12:45 Anketa i AWS credits: 3x $100

Agenda details •  Launch an instance •  Perform data analysis in Galaxy

•  Find exons with most SNPs •  Demonstrate the following CloudMan

features and prepare for the data analysis part: •  Manual & Auto-scaling •  Using an S3 bucket as a data source •  Accessing an instance over ssh •  Customizing an instance •  Controlling Galaxy •  Sharing-an-instance

Inte

rac

tio

n fl

ow

On human chromosome 22, which coding exons have the most SNPs in them?

A Rough Plan

• Get some data • Coding exons on chromosome 22 • SNPs on chromosome 22

• Mess with it • Identify which exons have SNPs • Count SNPs per exon

• Visualize our results

Exons, from UCSC SNPs, from UCSC

Exons, from UCSC SNPs, from UCSC

Exons, from UCSC

SNPs, from UCSC

Overlap pairings

Exons, from UCSC SNPs, from UCSC

1 1 2

Exons, from UCSC

SNPs, from UCSC

Overlap pairings

Exon overlap counts

Exons, from UCSC

1 1 2

Exon overlap counts

Exons, from UCSC

1 1 2

Exon overlap counts

1 1 2

Join on exon name 0 0 0

Exons, from UCSC

1 1 2

Exon overlap counts

1 1 2

Join on exon name 0 0 0

1 1 2

Rearrange columns w/ cut

Data types overview: BED •  Tab-delimited text file that defines a feature track •  Zero-based •  One line per feature •  Each line contains 3-12 columns

Data types overview: Tabular / Interval

•  Tab-delimited text file •  Interval

•  Each line represents genomic intervals •  Zero-based •  One line per interval •  Each line contains 3-5 columns

Your turn http://usegalaxy.org/galaxy101

Slides @ http://bit.ly/irb-ws

Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije

10:00-10:15 Q&A / pauza

10:15-10:30 Pokretanje vlastitog CloudMan klastera

10:30-11:30 Galaxy 101

11:30-11:45 Q&A / pauza

11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija

12:30-12:45 Anketa i AWS credits: 3x $100

Agenda details •  Launch an instance •  Perform data analysis in Galaxy

•  Find exons with most SNPs •  Demonstrate the following CloudMan

features and prepare for the data analysis part: •  Manual & Auto-scaling •  Using an S3 bucket as a data source •  Accessing an instance over ssh •  Customizing an instance •  Controlling Galaxy •  Sharing-an-instance

Inte

rac

tio

n fl

ow

Scaling computation

YES

YES

NO

Manual scaling •  Explicitly add 1 worker node to your cluster

•  Node type corresponds to node processing capacity

•  Research use of Spot instances

Auto-scaling

Public / shared data •  Take a look at the 1000 Genomes data

•  Take a look at AWS Public Datasets

•  More examples exist

•  How to use this freely available data and make new discoveries?

Using an S3 bucket as a data source

Accessing an instance over ssh

Use the terminal (or install Secure Shell for Chrome)

SSH using user ubuntu and the password you chose when launching an instance:

[local machine]$ ssh ubuntu@<instance IP address>

Once logged in

•  You have full system access to your instance, including sudo; use it as any other system

•  galaxy user exists on the system and should be used when manipulating Galaxy (sudo su galaxy)

•  Can submit any jobs via the standard qsub command

Customizing an instance •  Edit Galaxy’s configuration

$ sudo su galaxy

$ cd /mnt/galaxy/galaxy-app

$ nano universe_wsgi.ini

allow_library_path_paste = True

Controlling Galaxy •  Start/stop Galaxy application

•  Add an admin user

•  Use the email you registered with

S3 bucket as a data library

•  Within Galaxy, create a Data Library, using S3 bucket path as the data source (/mnt/workshop-data)

•  This will import all the datasets into the Data Library

•  Import that datasets into a history

Proširivanje palete programa •  Galaxy ToolShed = App Store za Galaxy

•  Need to be an Admin to use

•  Browse the Main ToolShed and install needed tool(s)

Sharing-an-Instance •  Share the entire CloudMan platform

•  Includes all of user data and even the customizations

•  Publish a self-contained analysis

•  Make a note of the share-string and send it to your neighbor

Pregled radionice 9:30-10:00 Uvodno predavanje: Galaxy i CloudMan aplikacije

10:00-10:15 Q&A / pauza

10:15-10:30 Pokretanje vlastitog CloudMan klastera

10:30-11:30 Galaxy 101

11:30-11:45 Q&A / pauza

11:45-12:30 Podešavanje Galaxy i CloudMan aplikacija

12:30-12:45 Anketa i AWS credits: 3x $100

Want more tutorials?

genome.edu.au/wiki/Learn

galaxy-tut.genome.edu.au

•  RNA-seq (basic and advanced)

•  Variant detection (basic and advanced)

•  Genome assembly

•  Quality control for small RNA

•  …

Anketa

bit.ly/IRBanketa

AWS Credits 3x $100

Vrijedi samo za AWS usluge

Hoće li Vam uistinu biti korisno za rad? Iznesite kako u jednoj minuti!

Pisani izvještaj (kratko!) o iskustvu nakon ~3 mjeseca