the national center for genomic analysis support: creating a national cyberinfrastructure...

1
The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for genomics researchers. William Barnett, Thomas G. Doak, Le-Shin Wu, Carrie L. Ganote Indiana University Services Offered Cyberinfrastructure and Architecture User Feedback BLAST on the OSG Original Concept Grid + Cluster Concept Software supported on the Mason cluster as of October, 2014 [5] abyss fastqc oases allpathslg galaxy picard amos gatk raxml arachne genomemapper rsem bedtools gmap sam2counts bio3d hmmer samtools bioconductor khmer scythe blat macs shore bowtie maker smrt bwa metamos soapdenovo cd-hit mlrho sra-toolkit celera mothur stacks cufflinks mummer tophat cutadapt ninja transabyss cytoscape namd trinityrnaseq edena novoalign velvet Bioinformatics consulting including advice on library preparation, choice of assembly software, and recommended parameters . • Cyberinfrastructure – as seen to the right, the hardware and system support to manage and analyze genomics data at scale. Archival data storage – Archival tape storage for long-term safe deposit of final results and raw data. In addition, the IUScholarWorks repository can be linked to the archived data providing a convenient link for access to raw or supplementary data. Curated software support – popular software tools are installed, optimized, and maintained on IU machines; e.g.: Genome Browser deployment – pictured above, a genome browser loaded with your data and hosted by NCGAS. Graphical interfaces for bioinformatics tools Galaxy and GenePattern - two web-based portals for bioinformatics analysis deployed by NCGAS. The Trinity assembler has a web portal in its own version of Galaxy. Bioinformatics consulting – the largest perceived need among the participants of this study was grant-funded bioinformatics consulting support. Data storage and movement– after consulting support, the handling of data was the next obstacle where participants indicated help would be important. This includes the long- and short-term storage of data, as well as the movement of large data sets. • Cyberinfrastructure – High performance computing with sufficient processing power and memory is another area researchers would find helpful. Curated software support participants chose curated, installed and maintained, published software applications among needed services. Mason large memory cluster – with 512 GB of memory and 32 cores in each of its 18 nodes, the Mason cluster is a real workhorse of bioinformatics analysis. Open Science Grid – with a highly distributed grid architecture, the OSG provides opportunistic cycles, allowing a user to potentially run thousands of tasks at once. • XSEDE – the Extreme Science and Engineering Discovery Environment awards allocations on some of the country’s fastest and largest supercomputers. Data Storage and movement – an optional 50TB allocation is available to NCGAS users on the 15/7PB Scholarly Data Archive for archiving. The Data Capacitor 2 is IU’s 5PB high performance file system, tuned for fast reading and writing of large files. These systems are tied into the 100GigE Internet 2 backbone. BLAST[1] is an essential bioinformatic tool, heavily used in genomics to infer homology between sequences. BLAST treats each input as a separate entity and can be run in a highly parallel fashion—this makes it an ideal target for running on a grid. A typical Galaxy setup showing the different connections machines use to communicate, handle data, and send jobs. The Mason Cluster in IU’s Data Center JBrowse [6] on the IQ Wall, Cyberinfrastructure Building, Bloomington IN The Blast on OSG tool viewed through the Galaxy [2-4] interface These diagrams show two communications setups between the Galaxy server and the job, running through HTCondor on the Open Science Grid. NCGAS conducted two recent surveys to assess the needs of genomics researchers. The first survey was addressed to NCGAS users, and the second went to NSF-funded biologists. Results from the second survey include: On a 1-5 Likert scale, where 1 is “very dissatisfied” and 5 is “very satisfied,” the average overall score for NCGAS services was 4.4 ± 1.4 (95% confidence interval). 63% of respondents indicated, “I could not have done my research without NCGAS,” while another 30% indicated NCGAS was helpful to completing their research. Common comments and requests from participants included requests for better data handling, documentation and training, and more personnel for NCGAS. Results from NCGAS User Survey Results from NSF Survey References http://ncgas.org BLAST [1] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. Galaxy [2] Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86. Galaxy [3] Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21. Galaxy [4] Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research. 2005 Oct; 15(10):1451-5. PY3 [5] Barnett, William K.; Stewart, Craig A. (2014). National Center for Genome Analysis Program Year 3 Report – September 15, 2013 – September 14, 2014. http://hdl.handle.net/2022/18513 Jbrowse [6] Skinner, M. E., Uzilov, A. V., Stein, L. D., Mungall, C. J., and Holmes, I. H. (2009). JBrowse: a next-generation genome browser. Genome research, 19(9):1630-1638.

Upload: marshall-knight

Post on 18-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for genomics researchers. William Barnett, Thomas

The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for

genomics researchers. William Barnett, Thomas G. Doak, Le-Shin Wu, Carrie L. Ganote Indiana University

Services Offered

Cyberinfrastructure and Architecture

User Feedback

BLAST on the OSG

Original Concept

Grid + Cluster Concept

Software supported on the Mason cluster as of October, 2014 [5]

abyss fastqc oases allpathslg galaxy picard

amos gatk raxml arachne genomemapper rsem bedtools gmap sam2counts

bio3d hmmer samtools bioconductor khmer scythe

blat macs shore bowtie maker smrt

bwa metamos soapdenovo cd-hit mlrho sra-toolkit celera mothur stacks

cufflinks mummer tophat cutadapt ninja transabyss cytoscape namd trinityrnaseq

edena novoalign velvet

• Bioinformatics consulting – including advice on library

preparation, choice of assembly software, and recommended parameters .

• Cyberinfrastructure – as seen to the right, the hardware and

system support to manage and analyze genomics data at scale.

• Archival data storage – Archival tape storage for long-term

safe deposit of final results and raw data. In addition, the IUScholarWorks repository can be linked to the archived data providing a convenient link for access to raw or supplementary data.

• Curated software support – popular software tools are

installed, optimized, and maintained on IU machines; e.g.:

• Genome Browser deployment –

pictured above, a genome browser loaded with your data and hosted by NCGAS.

• Graphical interfaces for bioinformatics tools – Galaxy and GenePattern -

two web-based portals for bioinformatics analysis deployed by NCGAS. The Trinity assembler has a web portal in its own version of Galaxy.

• Bioinformatics consulting – the largest perceived need

among the participants of this study was grant-funded bioinformatics consulting support.

• Data storage and movement– after consulting support,

the handling of data was the next obstacle where participants indicated help would be important. This includes the long- and short-term storage of data, as well as the movement of large data sets.

• Cyberinfrastructure – High performance computing with

sufficient processing power and memory is another area researchers would find helpful.

• Curated software support – participants chose curated,

installed and maintained, published software applications among needed services.

• Mason large memory cluster – with

512 GB of memory and 32 cores in each of its 18 nodes, the Mason cluster is a real workhorse of bioinformatics analysis.

• Open Science Grid – with a highly distributed

grid architecture, the OSG provides opportunistic cycles, allowing a user to potentially run thousands of tasks at once.

• XSEDE – the Extreme Science and Engineering Discovery

Environment awards allocations on some of the country’s fastest and largest supercomputers.

• Data Storage and movement – an

optional 50TB allocation is available to NCGAS users on the 15/7PB Scholarly Data Archive for archiving. The Data Capacitor 2 is IU’s 5PB high performance file system, tuned for fast reading and writing of large files. These systems are tied into the 100GigE Internet 2 backbone.

BLAST[1] is an essential bioinformatic tool, heavily used in

genomics to infer homology between sequences. BLAST treats each input as a separate entity and can be run in a highly parallel fashion—this makes it an ideal target for running on a grid.

A typical Galaxy setup showing the different connections machines use to communicate, handle data, and send jobs.

The Mason Cluster in IU’s Data CenterJBrowse [6] on the IQ Wall, Cyberinfrastructure Building, Bloomington IN

The Blast on OSG tool viewed through the Galaxy [2-4] interface

These diagrams show two communications setups between

the Galaxy server and the job, running through HTCondor on

the Open Science Grid.

NCGAS conducted two recent surveys to assess the needs of genomics researchers. The first survey was addressed to NCGAS users, and the second went to NSF-funded biologists. Results from the second survey include:

• On a 1-5 Likert scale, where 1 is “very dissatisfied” and 5 is “very satisfied,” the average overall score for NCGAS services was 4.4 ± 1.4 (95% confidence interval).

• 63% of respondents indicated, “I could not have done my research without NCGAS,” while another 30% indicated NCGAS was helpful to completing their research.

• Common comments and requests from participants included requests for better data handling, documentation and training, and more personnel for NCGAS.

Results from NCGAS User Survey

Results from NSF Survey

References

http://ncgas.org

BLAST [1] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.Galaxy [2] Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86. Galaxy [3] Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21. Galaxy [4] Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research. 2005 Oct; 15(10):1451-5. PY3 [5] Barnett, William K.; Stewart, Craig A. (2014). National Center for Genome Analysis Program Year 3 Report – September 15, 2013 – September 14, 2014. http://hdl.handle.net/2022/18513 Jbrowse [6] Skinner, M. E., Uzilov, A. V., Stein, L. D., Mungall, C. J., and Holmes, I. H. (2009). JBrowse: a next-generation genome browser. Genome research, 19(9):1630-1638.