the future of galaxy - xsede...2013/12/17 · initial galaxy data staging to psc penn state psc...
TRANSCRIPT
The Future of Galaxy
Nate Coraor
galaxyproject.org
Galaxy is...
• A framework for scientists
• Enables usage of complicated command line tools
• Deals with file formats as transparently as possible
• Provides a rich visualization and visual analytics
system
Galaxy is...
• getgalaxy.org
• Free, open source software
• Bring your own compute, storage, tools
• Maximize privacy and security
• usegalaxy.org/cloud
• Galaxy cluster in Amazon EC2
• Buy as much compute, storage as you need
• usegalaxy.org
• Free, public Galaxy server
• 3.5 TB of reference data
• 0.8 PB of user data
• 4,000+ jobs/day
New Users per Month
300
500
700
900
1100
1300
1500
Jan 2010 Jan 2011 Jan 2012 Jan 2013
Wednesday, July 17, 13
usegalaxy.org data growth
+128 cores for NGS/multicore jobs
Data quotas implemented...
usegalaxy.org frustration growth
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
2008
-04
2008
-08
2008
-12
2009
-04
2009
-08
2009
-12
2010
-04
2010
-08
2010
-12
2011
-04
2011
-08
2011
-12
2012
-04
2012
-08
2012
-12
2013
-04
2013
-08
To
tal
Jo
bs C
om
ple
ted
(co
un
t)
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
Jo
bs D
ele
ted
Befo
re R
un
(%
of
su
bm
itte
d)
Where we are
Where we are going
Where we are going
Where we are going
• Continuing work with ECSS to submit jobs
to disparate XSEDE resources
• Globus Online endpoint for usegalaxy.org
• Allow users to utilize their XSEDE
allocations directly through usegalaxy.org
• Display detailed information about queue
position and resource utilization
Massive Scale Analysis
• Improve Galaxy workflow engine and UI
– We can run workflows on single datasets now
– What about hundreds or thousands?
Scaling Efforts
• So many tools and workflows, not enough
manpower
– Focus on building infrastructure to allow
community to integrate and share tools,
workflows, and best practices
• Too much data, not enough infrastructure
– Support greater access to usegalaxy.org
public and user data from local and cloud
Galaxy instances
Data Exchange
• A big data store for encouraging data
exchange among Galaxy instances
• Galaxy data mirrored in PSC SLASH2-
backed Data Supercell
• Federation
Establishing an XSEDE Galaxy Gateway XSEDE ECSS Symposium, December 17 2013
Philip Blood
Senior Computational Scientist
Pittsburgh Supercomputing Center
18 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
Galaxy Team:
PSC Team:
19 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
643 HiSeqs = 6.5 Pb/year
20 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
Using Galaxy to Handle Big Data?
Compartmentalized solutions:
• Private Galaxy installations on Campuses
• Galaxy installations on XSEDE (e.g. NCGAS)
• Galaxy installations at other CI/cloud providers (e.g. Globus Genomics)
• Galaxy on public clouds (e.g. Amazon)
21 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
The Vision: A United Federation of Galaxies
Ultimately, we envision that any Galaxy instance (in any lab, not just Galaxy Main) will be able to spawn jobs, access data, and share data on external infrastructure whether this is an XSEDE resource, a cluster of Amazon EC2 machines, a remote storage array, etc.
22 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
A Step Forward: Make Galaxy Main an XSEDE Galaxy Gateway
• Certain Galaxy Main workflows or tasks will be executed on XSEDE resources
• Especially, tasks that require HPC, e.g. de-novo assembly applications Velvet (of genome) and Trinity (of transcriptome) to PSC Blacklight (up to 16 TB of coherent shared memory per process)
• Should be transparent to the user of usegalaxy.org
23 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
Key Problems to Solve
• Data Migration: Galaxy currently relies on a shared filesystem between the instance host and the execution server to store the reference and user data required by the workflow. This is implemented via NFS.
• Remote Job Submission: Galaxy job execution currently requires a direct interface with the resource manager on the execution server.
24 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
What We’ve Done So Far*
• Addressing Data Migration Issues – Established 10 GigE link between PSC and Penn State
– Established a common wide area distributed filesystem between PSC and Penn State using SLASH2 (http://quipu.psc.teragrid.org/slash2/)
• Addressing Remote Job Submission – Created a new Galaxy job-running plugin for SSH job submission
– Incorporated Velvet and Trinity into Galaxy’s XML interface
– Successfully submitted test jobs from Penn State and executed on Blacklight using the data replicated via SLASH2 from Penn State to PSC.
*Some of these points will be revisited, since Galaxy is now hosted at TACC
25 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
Galaxy Remote Data Architecture
Galaxy Main
PSC
/galaxys2
/galaxys2
Data Generation and Processing Nodes
Data Generation and Processing Nodes
SLASH2
Wide-Area
Common
File system
GalaxyFS
Access is identical from Galaxy Main and PSC to the shared dataset via /galaxys2 SLASH2 file system handles consistency and multiple residency coherency and presence Local copies are maintained for performance Jobs run on PSC compute resources such as Blacklight, as well as Galaxy Main
26 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
Galaxy Main Gateway: What Remains to Be Done (1)
• Integrate this work with the production public Galaxy site, usegalaxy.org (now hosted at TACC)
• Dynamic job submission: allowing the selection of appropriate remote or local resources (cores, memory, walltime, etc.) based on individual job requirements (possibly using an Open Grid Services Architecture Basic Execution Service compatible service, such as Unicore)
27 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
What Remains to Be Done (2)
• Galaxy-controlled data management: to intelligently and efficiently migrate and use data on distributed compute resources – Testing various data migration strategies with SLASH2 and other available
technologies
– Further developing SLASH2 to meet Federated Galaxy requirements through recent NSF DIBBs award at PSC
• Authentication with Galaxy instances: using XSEDE or other credentials, e.g., InCommon/CILogon (see upcoming talk by Indiana)
• Additional data transfer capabilities in Galaxy: such as IRODS and Globus Online (see upcoming talk on Globus Genomics)
28 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
Eventually: Use These Technologies to Enable Universal Federation
29 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
Appendix
• Initial Galaxy Data Staging to PSC
• Underlying SLASH2 Architecture
30 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
Initial Galaxy Data Staging to PSC
Penn State
PSC
Data SuperCell
Data Generation Nodes
Storage
10gigE link
Transferred 470TB in 21 days from PSU to PSC (average ~22TB/day; peak 40 TB/day) rsync used to initially stage and synchronize subsequent updates Data copy maintained in PSC in /arc file system available from compute nodes
31 © 2010 Pittsburgh Supercomputing Center
© 2013 Pittsburgh Supercomputing Center
Underlying SLASH2 Architecture
Metadata Server (MDS)
I/O Servers (IOS)
Clients
I/O Servers (IOS)
I/O Servers (IOS)
I/O Servers (IOS)
One at Galaxy Main and one at PSC for performance
Converts pathnames to object IDs
Schedules updates when copies become inconsistent
Consistency protocol to avoid incoherent data
Residency and network scheduling policies enforced
Clients are compute resources & dedicated front ends
Dataset residency requests issued from administrators and/or users
I/O servers are very lightweight Can use most backing file systems (ZFS, ext4fs, etc.)
READ and WRITE
All other file ops (RENAME, SYMLINK, etc.)
• Funded by National Science Foundation 1. Large memory clusters for assembly
2. Bioinformatics consulting for biologists
3. Optimized software for better efficiency
• Collaboration across IU, TACC, SDSC,
and PSC.
• Open for business at: http://ncgas.org
Making it easier for Biologists
• Web interface to NCGAS
resources
• Supports many
bioinformatics tools
• Available for both
research and instruction.
Common
Rare
Computational Skills
LOW
HIGH
GALAXY.NCGAS.ORG Model
Virtual box hosting
Galaxy.ncgas.org
The host for each tool is
configured individually
Quarry Mason
Data
Capacitor Archive
NCGAS establishes
tools, hardens them,
and moves them into
production.
Individual projects can get duplicate boxes – provided they support it themselves.
10 Gbps
100 Gbps NCGAS Mason
(Free for
NSF users)
IU POD
(12 cents
per core hour) Data Capacitor NO data storage Charges
Your Friendly
Neighborhood
Sequencing Center
Your Friendly
Neighborhood
Sequencing Center
Your Friendly
Neighborhood
Sequencing Center
Moving Forward
Other NCGAS XSEDE Resources…
Lustre WAN File System
Globus On-line
and other tools
Optimized Software
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1-Jan 1-Feb 1-Mar 1-Apr 1-May 1-Jun 1-Jul 1-Aug 1-Sep 1-Oct 1-Nov
Co
re H
ou
rs
Month
NCGAS Galaxy Usage: 2013
CILogon Authentication for Galaxy
Dec. 17, 2013
Goals and Approaches
NCGAS Authentication Requirements:
XSEDE users can authenticate with NCGAS Galaxy through
InCommon credentials.
Only NCGAS authorized users can authenticate and use the
resource.
CILogon Service (http://www.cilogon.org) allows users to
authenticate with their home organization and obtain a
certificate for secure access to CyberInfrastructure. It
supports MyProxy OAuth protocol for certificate delegation
to enable science gateways to access CI on user’s behalf.
Incorporate CILogon as external user authentication for
Galaxy, with home-brewed simple authorization mechanism.
Technical Challenges
CILogon OAuth client implementation is Java while
Galaxy is Python;
Python lacks full featured OAuth libraries supporting
RSA-SHA1 signature method required by CILogon's
OAuth interface.
Once authenticated through CILogon, remote username
needs to be forwarded to Galaxy via Apache proxy;
Additional authorization required for CILogon
authenticated users;
Some of the default CILogon IdPs including OpenID
providers (Google, Paypal, Verisign) are not desired.
Architecture
Authentication
Apache Web Server
PHP CILogon OAuth Client
HTTP_COOKIE
Technical Highlights
PHP (non Java) implementation of CILogon OAuth Client.
Configure Apache proxy to Galaxy:
Enable Galaxy external user authentication (universe_wsgi.ini);
Configure Apache for proxy forwarding; (httpd ssl.conf);
Configure Apache for CILogon authentication with HTTP_COOKIE
rewrite; (httpd ssl.conf)
Customized NCGAS Skin limiting IdP to InCommon academic.
PHP implementation of simple file-based user authorization.
Lightweight, packaged for general Galaxy installation.
Open source and more details at:
http://sourceforge.net/p/ogce/svn/HEAD/tree/Galaxy/
Experiences in building a next-
generation sequencing analysis service
using Galaxy, Globus, and Amazon Web
Services
Ravi K Madduri
Argonne National Lab and University of Chicago
www.globus.org/genomics
Globus Genomics Architecture
www.globus.org/genomics
• Integrated Identity management, Group management and Data movement using Globus
• Computational profiles for various analysis tools
• Resources can be provisioned on-demand with Amazon Web Services cloud based infrastructure
• Glusterfs as a shared file system between head nodes and compute nodes
• Provisioned I/O on EBS
Globus Genomics Solution Description
www.globus.org/genomics
Globus Genomics Usage
www.globus.org/genomics
1Computation Institute, University of Chicago, Chicago, IL, USA. 2Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA. 3 Section Genetic Medicine, University of Chicago, Chicago, IL.
Challenges in Next-Gen Sequencing Analysis
!
!
!
!
!
!
!
High Performance, Reusable Consensus
Calling Pipeline
!
!
!
"
"
"
! "
!
"
"
Parallel Workflows on Globus Genomics
!
!
!
!
!
!
!
Example User – Cox Lab
www.globus.org/genomics
Globus Genomics Pricing
www.globus.org/genomics
• This work was supported in part by the NIH
through the NHLBI grant: The Cardiovascular
Research Grid (R24HL085343) and by the
U.S. Department of Energy under contract
DE-AC02-06CH11357. We are grateful to
Amazon, Inc., for an award of Amazon Web
Services time that facilitated early
experiments.
• The Globus Genomics and Globus Online
teams at University of Chicago and Argonne
National Laboratory
Acknowledgments
www.globus.org/genomics
• More information on Globus Genomics and
to sign up: www.globus.org/genomics
• More information on Globus Online:
www.globus.org
• Questions?
• Thank you!
For more information