the galaxy bioinformatics workflow environment
DESCRIPTION
Introduction to the Galaxy environment for workflows in bioinformaticsTRANSCRIPT
![Page 1: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/1.jpg)
GALAXY BIOINFORMATICS WORKFLOW
ENVIRONMENT
Rutger Vos, 3 April 2012
![Page 2: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/2.jpg)
Informatics in the post-genomic era
Overview
![Page 3: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/3.jpg)
The past (?)
Analyses glued together using scripting languages, directly on the CLI or in GUI
Sanger sequencing
Smaller data volumes
Fewer remote data resources
Hypothesis-driven
![Page 4: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/4.jpg)
The present
Graphical or text-based workflow tools
“Next generation” sequencing
Large data sets
Many remote data resources
“Data-driven”
![Page 5: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/5.jpg)
NGS – Roche 454 pyrosequencing
“Emulsion PCR”
Bead with primer in each droplet
Each bead is placed in a well with luciferase
Plate is analyzed by fiber-optic chip
Pyrosequencing Genome Sequencer FLX
![Page 6: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/6.jpg)
NGS – Illumina/Solexa
DNA attaches to primer on slide and is amplified
4 RT-bases are added Camera detects labeled
nucleotides
Next 1-base cycle
Reversible dye-terminator seq HiSeq2000 (BGI)
![Page 7: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/7.jpg)
NGS – IonTorrent
“sequencing by synthesis”
Not light-based, sensor detects H+ ions during synthesis
Longer reads
Ion semiconductor sequencing Ion Torrent PGM
![Page 8: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/8.jpg)
NGS – SOLiD Sequencing
Beads with DNA fragments
Universal adapter attached to fragments
PCR product attaches to slide
Fluorescent probes ligate to the primer
Oligonucleotide ligation and detection SOLiD 5500 Genetic Analyzer
![Page 9: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/9.jpg)
The future
“Next-next generation” sequencing (MinION?)
Smaller data sets? Semantic web?
Back to hypotheses?
![Page 10: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/10.jpg)
Tools for automating bioinformatics analyses
Workflows
![Page 11: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/11.jpg)
Taverna taverna.org.uk
Galaxy usegalaxy.org
eHive ensembl.org/info/docs/eHive
Mobyle mobyle.pasteur.fr
Cipres phylo.org
Yahoo! Pipes pipes.yahoo.com
“make” gnu.org/software/make
Examples of workflow tools
![Page 12: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/12.jpg)
Galaxy
![Page 13: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/13.jpg)
Provenance, histories
Where do the data come from?
How were they altered?
Galaxy tracks the history of data.
Histories can be converted to workflows
![Page 14: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/14.jpg)
Creating a workflow
![Page 15: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/15.jpg)
Workflow editor
![Page 16: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/16.jpg)
Reproducibility
Good science requires that results be reproducible Some analyses are run many times
![Page 17: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/17.jpg)
Sharing
“Standing on the shoulders of giants”
Not re-inventing the wheel
“Executable papers” (doi:10.1101/gr.094508.109)
![Page 18: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/18.jpg)
What data types, expressed in which formats, does Galaxy operate on? How do I get data into and out of Galaxy?
Data
![Page 19: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/19.jpg)
• Sequences • Alignments • Intervals • Tabular data
Data types
![Page 20: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/20.jpg)
File format conversion
Built-in converters between related file formats are provided
Additional converters can be added
![Page 21: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/21.jpg)
Galaxy sequence formats
FASTA – def line, sequence
FASTQ – FASTA + quality - SOLEXA:
ABI/SCF – binary sequence trace (see below) SFF (454) – binary flowgram format
![Page 22: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/22.jpg)
Galaxy alignment formats
MAF – multiple alignment format (see below)
(S|B)AM – text or binary format for reads and ref
AXT – pairwise alignment, from LAV LAV – BLASTZ output, pairwise alignment
![Page 23: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/23.jpg)
Galaxy interval/feature formats
BED chrom, start, end, …
INTERVAL BED with headers
GFF (GFF3) like BED and
interval, but 1-based inclusive
WIG(GLE) dense, continuous-
valued tracks
![Page 24: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/24.jpg)
Other data formats
In addition to interval/feature tabular data as listed previously, other files with similar properties can be processed by some tools: *.txt tab-separated values (e.g. tabular FASTA)
HTML (for additional prose)
LPED/PBED (to describe SNPs, really two files, one for coordinates, other for alleles)
![Page 25: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/25.jpg)
• Upload via FTP • Fetch by URL • Import from data library • Provided by interoperable web service
Data I/O
![Page 26: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/26.jpg)
Upload data using FTP
![Page 27: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/27.jpg)
Fetch data from a URL
![Page 28: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/28.jpg)
Import from data library
![Page 29: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/29.jpg)
Fetch data from proxy service
User submits proxy request to Galaxy
Galaxy forwards request to remote service
Service returns data Galaxy infers data type and presents results
![Page 30: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/30.jpg)
“Big Data”
NGS has led to massive data sets
Data formats are simple, binary, and/or compressed
Still, people drive around with USB hard disks
![Page 31: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/31.jpg)
Data sharing/publishing
The Galaxy platform allows users to publish and share their data, for example as supplemental materials to a publication*
* example: http://genome.cshlp.org/content/19/11/2144
![Page 32: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/32.jpg)
Which operations can I run on Galaxy?
Tools
![Page 33: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/33.jpg)
Galaxy tools
Send and Get data – Upload, fetch, send, submit Data manipulation – Join, sort, filter Format conversion – FASTA, other format operations Statistics – Regressions, simulations, model tests NGS – BAM, FASTQ, SOLiD, 454 file operations RNA analysis – cufflinks, tophat Evolution – branch lengths, NJ, HyPhy
![Page 34: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/34.jpg)
What data can I view in Galaxy?
Visualization
![Page 35: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/35.jpg)
Galaxy visualization - histograms
![Page 36: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/36.jpg)
Galaxy visualization - scatterplots
![Page 37: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/37.jpg)
Galaxy visualization – XY plot
![Page 38: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/38.jpg)
Galaxy visualization – box plot
![Page 39: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/39.jpg)
Galaxy visualization - trackster
![Page 40: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/40.jpg)
How to deploy your analyses on Galaxy, on public servers, in the cloud or locally
Deployment
![Page 41: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/41.jpg)
• “Main” at UseGalaxy.org • NBIC • Others listed on Galaxy wiki
Public servers
![Page 42: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/42.jpg)
Local server
Data close by
Can add your own tools
Can develop Galaxy further
UNIX-based
Complicated install
Many dependencies
UNIX-based
Pro: Con:
![Page 43: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/43.jpg)
Cloud Galaxy
Galaxy can be installed in the (Amazon EC2 cloud) Private data without the hardware hassle Uploading and storing data can be costly, however
![Page 44: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/44.jpg)
How does it work under the hood?
Implementation
![Page 45: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/45.jpg)
Galaxy under the hood
Issues command
1. Parses HTTP request 2. Identifies which tool to use 3. Reads tool description 4. Queues tool 5. Parses result 6. Returns HTML representation of result
![Page 46: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/46.jpg)
Web server
Simple Galaxy installs use a built-in, python-based HTTP server
More robust installs typically use the Apache httpd server
![Page 47: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/47.jpg)
Code base
Most of the framework and the wrapper code is written in python
Some wrappers in other languages, e.g. perl
![Page 48: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/48.jpg)
Interface language
Under the hood, Galaxy executes command-line programs and scripts
Their interfaces and tool tips are described in XML files
![Page 49: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/49.jpg)
Queuing
Jobs are executed asynchronously
Progress is shown in the data browser
On big servers (e.g. “Main”), queuing is managed by the dedicated “Torque” system
![Page 50: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/50.jpg)
UNIX
Galaxy (simply)executes command-line programs within a UNIX-like environment
Galaxy doesn’t have to “know” how to run those programs, it finds out from their descriptions at runtime
![Page 51: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/51.jpg)
Database
Analysis metadata is stored in a database, by default this is SQLite
More robust installs use PostgreSQL
![Page 52: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/52.jpg)
Configuration
Galaxy has many moving parts that can be configured
Configuration is done using simple text-based INI files
![Page 53: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/53.jpg)
Version control
Revision control (or version control) provides unlimited undo and detailed tracking of changes
Galaxy uses Mercurial
Popular now are svn, git and hg
![Page 54: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/54.jpg)
How to get in touch with the world-wide community to get the most out of Galaxy?
Community
![Page 55: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/55.jpg)
Wiki
![Page 56: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/56.jpg)
Mailing lists
@lists.bx.psu.edu: galaxy-user
galaxy-dev
galaxy-announce
galaxy-commits
![Page 57: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/57.jpg)
Tutorials
Galaxy 101: Getting data from UCSC
Performing simple data manipulation
Understanding Galaxy's History system
Creating and editing workflows
Applying workflows to your data
![Page 58: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/58.jpg)
Screencasts
![Page 59: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/59.jpg)
Events
Galaxy Community Conference
ISMB ECCB
BOSC
PAG
Bio-IT World
GMOD Meetings
![Page 60: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/60.jpg)
“Tool shed”
Easy sharing of new tools
Based on Mercurial Turns Galaxy into a
modular ecosystem
![Page 61: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/61.jpg)
CiteULike group
Social citation manager There is a Galaxy group:
citeulike.org/group/16008
Articles are tagged by: PROJECT, ISGALAXY, SHARED,
HOWTO, METHODS, REPRODUCIBILITY, WORKFLOW
![Page 62: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/62.jpg)
Organizations
Development: Penn State University
Emory University
Support: NSF
NHGRI
Power user: NBIC
![Page 63: The Galaxy bioinformatics workflow environment](https://reader033.vdocuments.site/reader033/viewer/2022052522/554e89e4b4c90526358b4950/html5/thumbnails/63.jpg)
Links
UseGalaxy.org – the Main server
GetGalaxy.org – for local installs
UseGalaxy.org/galaxy101 – intro tutorial Galaxy.nbic.nl – Sombrero, the NBIC server
CiteULike.org/group/16008 – references
genome.cshlp.org/content/19/11/2144 – windshield paper
SlideShare.net/rvosa – these slides