asaim documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · docker session is...

25
ASaiM Documentation Release 0.1 May 24, 2018

Upload: others

Post on 03-Jul-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM DocumentationRelease 0.1

May 24, 2018

Page 2: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping
Page 3: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

Contents

1 A framework built on the shoulders of giants 3

2 . . . Dedicated to microbiota analyses 52.1 Installation and use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Launching ASaiM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.4 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.5 Interactive session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.6 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.7 Usage of ASaiM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.8 Users & Passwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.9 Stoping ASaiM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.10 Installation of the tools, workflows and tours on an existing Galaxy instance . . . . . . . 8

2.1.10.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.10.2 Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.10.3 Tours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Tutorials and interactive tours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Interactive tours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.1 File and meta tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1.1 Data retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1.2 Text manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1.3 Sequence file manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1.4 BAM/SAM file manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.1.5 BIOM file manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.2 Genomics tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2.1 Quality control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.2.3 Sorting and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2.4 Similarity search and alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2.5 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.3 Microbiota dedicated tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.3.1 Metagenomics data manipulation . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.3.2 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.3.3 Metataxonomic sequence analysis . . . . . . . . . . . . . . . . . . . . . . . . 122.3.3.4 Taxonomy assignation on WGS sequences . . . . . . . . . . . . . . . . . . . . 132.3.3.5 Metabolism assignation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.3.6 Combination of functional and taxonomic results . . . . . . . . . . . . . . . . . 13

2.4 Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

i

Page 4: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

2.4.1 Analysis of raw metagenomic or metatranscriptomic shotgun data . . . . . . . . . . . . 142.4.2 Assembly of metagenomic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.3 Analysis of metataxonomic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.4 Running as in EBI metagenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Bibliography 19

ii

Page 5: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

New generation of sequencing platforms coupled to numerous bioinformatics tools has led to rapid technologicalprogress in metagenomics and metatranscriptomics to investigate complex microorganism communities. Never-theless, a combination of different bioinformatic tools remains necessary to draw conclusions out of microbiotastudies. Modular and user-friendly tools would greatly improve such studies.

We therefore developed ASaiM, an Open-Source Galaxy-based framework dedicated to microbiota data analyses.ASaiM offers sophisticated analyses to scientists without command-line knowledge. ASaiM provides a powerfulframework to easily and quickly explore microbiota data in a reproducible and transparent environment.

Contents 1

Page 6: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

2 Contents

Page 7: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

CHAPTER 1

A framework built on the shoulders of giants

To develop a modular, accessible, redistributable, sharable and user-friendly framework for scientist, we developedASaiM using

• Galaxy as the foundation

Galaxy is a lightweight environment providing a simple graphical interface to bioinformaticstools, while automatically managing computation and data details. It improves the usability andreproducibility of biological studies.

• Galaxy ToolShed, BioBlend and Ephemeris to install the tools, the worklows and the databases inside theGalaxy environment

• Conda to install the tools and their dependencies

• Docker to containerize and ship everything

3

Page 8: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

4 Chapter 1. A framework built on the shoulders of giants

Page 9: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

CHAPTER 2

. . . Dedicated to microbiota analyses

ASaiM integrate a comprehensive set of microbiota related tools, predefined and tested workflows dedicated tomicrobiota analyses.

Tools, workflows and ASaiM are supported by training material and documentation:

2.1 Installation and use

The ASaiM framework is using the Galaxy Docker to ease the deployment and customization of the Galaxyinstance.

2.1.1 Requirements

To use the ASaiM framework, Docker is required.

For Linux users and people familiar with the command line, please follow the very good instructions from theDocker project.

Non-Linux users are encouraged to use Kitematic, a graphical User-Interface for managing Docker containers.

How to use Kitematic for Galaxy Docker, a video realized for the Galaxy RNA workbench.

The databases used by HUMAnN2 are quite big, we recommend to have at least 100 Gb of disk space

2.1.2 Launching ASaiM

1. Starting the ASaiM Docker container: analogous to starting the generic Galaxy Docker image:

$ docker run -d -p 8080:80 quay.io/bebatut/asaim-framework

Nevertheless, here is a quick rundown:

• docker run starts the Image/Container

In case the Container is not already stored locally, Docker downloads it automati-cally

5

Page 10: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

• The argument -p 8080:80 makes the port 80 (inside of the container) available on port8080 on your host

Inside the container a Apache web server is running on port 80 and that port canbe bound to a local port on your host computer. With this parameter you can ac-cess your Galaxy instance via http://localhost:8080 immediately after executing thecommand above

• quay.io/bebatut/asaim-framework is the Image/Container name, that directsDocker to the correct path in the Docker index

• -d will start the Docker container in Daemon mode.

A detailed discussion of Docker’s parameters is given in the Docker manual. It isreally worth reading

The Docker container is run: Galaxy will be launched!

Setting up Galaxy and its components can take several minutes. You can inspect thestate of the starting using:

$ docker ps # to obtain the id of the container$ docker logs <container_id>

The previous commands will start the ASaiM framework with the configuration and launch of aGalaxy instance and its population with the needed tools, workflows and databases. The instancewill be accessible at http://localhost:8080

2. Installation of the databases once Galaxy is running

$ docker exec <container_id> ./run_data_managers

2.1.3 Workflows

To access to the workflows, you need to connect with the admin user (username: [email protected], password:admin). And you will have access to the workflows in the ‘Workflow’ section (Top panel)

2.1.4 Databases

Databases are automatically added to the Galaxy instance for MetaPhlAn2, HUMAnN2 and QIIME.

Sometimes the databases are not correctly seen by the tools. If it is the case, you need to force the connectionbetween the tool and the database:

• Connect with the admin user:

– username [email protected]

– password admin

• Go to the ‘Admin’ section (Top panel)

• Go to ‘Local data’ section (Left panel)

• Click on humann2_nucleotide_database, humann2_protein_database ormetaphlan2_database (depending on the database)

• Click on the ‘Reload button’ on the top

The table must be filled

If you want other databases for HUMAnN2 or QIIME, you can install them “manually”:

• Connect with the admin user:

– username [email protected]

6 Chapter 2. . . . Dedicated to microbiota analyses

Page 11: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

– password admin

• Go to the ‘Admin’ section (Top panel)

• Go to ‘Local data’ section (Left panel)

• Choose the database you want to import

2.1.5 Interactive session

For an interactive session, you can execute:

$ docker run -i -t -p 8080:80 quay.io/bebatut/asaim-framework /bin/bash

and manually invokes the startup script to start PostgreSQL, Apache and Galaxy and download the needdatabases.

For a more specific configuration, you can have a look at the documentation of the Galaxy Docker Image.

2.1.6 Data

Docker images are “read-only”. All changes during one session are lost after restart. This mode is useful to presentASaiM to your colleagues or to run workshops with it.

To install Tool Shed repositories or to save your data, you need to export the computed data to the host computer.Fortunately, this is as easy as:

$ docker run -d -p 8080:80 -v /home/user/galaxy_storage/:/export/ quay.io/bebatut/→˓asaim-framework

Given the additional -v /home/user/galaxy_storage/:/export/ parameter, Docker will mount thefolder /home/user/galaxy_storage into the Container under /export/. A startup.sh script, thatis usually starting Apache, PostgreSQL and Galaxy, will recognize the export directory with one of the followingoutcomes:

• In case of an empty /export/ directory, it will move the PostgreSQL database, the Galaxy databasedirectory, Shed Tools and Tool Dependencies and various configure scripts to /export/ and symlink back tothe original location.

• In case of a non-empty /export/, for example if you continue a previous session within the same folder,nothing will be moved, but the symlinks will be created.

This enables you to have different export folders for different sessions - meaning real separation of your differentprojects.

2.1.7 Usage of ASaiM

The previous commands will start the ASaiM framework with the configuration and launch of a Galaxy instanceand its population with the needed tools, workflows and databases. The instance will be accessible at http://localhost:8080.

2.1.8 Users & Passwords

The Galaxy Admin User has the username [email protected] and the password admin.

The PostgreSQL username is galaxy, the password galaxy and the database name galaxy. If you want to createnew users, please make sure to use the /export/ volume. Otherwise your user will be removed after yourDocker session is finished.

2.1. Installation and use 7

Page 12: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

2.1.9 Stoping ASaiM

Once you are done with the ASaiM framework, you can kill the container:

The image corresponding to the container will stay in memory. If you want to clean fully your Docker engine, youcan follow the Docker Cleanup Commands.

2.1.10 Installation of the tools, workflows and tours on an existing Galaxy in-stance

The tools, workflows and tours available for ASaiM can be easily installed on any existing Galaxy instance.

The first step is to nstall Ephemeris: conda install ephemeris

2.1.10.1 Tools

1. Download the YAML files

• asaim_tools_1.yaml

• asaim_tools_2.yaml

• asaim_tools_3.yaml

2. Install the tools (for each of the three files)

$ shed-install -t <YAML file path> -a <your API key> --galaxy <URL of→˓the Galaxy instance>

2.1.10.2 Workflows

1. Download the workflow files

• Shotgun data analysis

• QIIME Illumina overview

• EBI Metagenomics V3.0

2. Install the workflows (one by one)

$ workflow-install --workflow_path <GA file path> -a <your API key> --→˓galaxy <URL of the Galaxy instance>

2.1.10.3 Tours

1. Download the tours

• Amplicon data analysis

• Shotgun data analysis

2. Put the files on config/plugins/tours/ of the Galaxy folder

3. Restart the Galaxy instance

8 Chapter 2. . . . Dedicated to microbiota analyses

Page 13: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

2.2 Tutorials and interactive tours

We are caring about training. So we are working in close collaboration with the Galaxy Training Network (GTN)to develop training materials of data analyses based on Galaxy. These materials hosted on the GTN GitHubrepository are available online at http://training.galaxyproject.org.

2.2.1 Tutorials

We then developed several tutorials with hands-one about metagenomics:

Analyses of metagenomics data - The global picture

This tutorial introduces the amplicon and shotgun data analyses with the general principlesbehind and the differences.

16S Microbial Analysis with Mothur

In this tutorial the Standard Operating Procedure (SOP) for MiSeq data, developed by thecreators of the Mothur software package, is perfomed within Galaxy.

These tutorials can be run on ASaiM. We used then during several workshops on metagenomics data analysisusing ASaiM as training support.

These tutorials are also accessible directly from ASAiM.

2.2.2 Interactive tours

To complement these tutorials, interactive tours have been developed and integrated inside ASaiM. Such toursguide users through an entire analysis in an interactive (step-by-step) way.

Fig. 1: Example of tour for the “Analyses of metagenomics data - The global picture” tutorial

Some tours, included in every Galaxy instance, are also here to explain how to use Galaxy, the history, . . .

All the tours can be accessed inside Galaxy:

• Click on ‘Help’ (top panel)

• Click on ‘Interactive Tours’

• Choose the tours you want

• Enjoy!

2.3 Tools

More than 200 tools are automatically integrated in the custom Galaxy instance during its deployment. Theywere chosen for their use in exploitation of microbiota data, and are hierarchically organized (into sections andsubsection) to guide users and help them to choose the best tools for a specific analysis.

2.2. Tutorials and interactive tours 9

Page 14: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

2.3.1 File and meta tools

2.3.1.1 Data retrieval

Name Version Galaxy wrapperEBISearch 0.0.3 suite_ebi_toolsENASearch 0.1.1 suite_enasearchNCBI Entrez E-Utilties suite_ncbi_entrezNCBI SRA 2.8.1 sra_tools

2.3.1.2 Text manipulation

Name Ver-sion

Galaxy wrapper

Paste two files side by sideSelect random lines from a fileLine/Word/Character count of a datasetFilter data on any column using simple expressionsSort data in ascending or descending orderSelect lines that match an expressionJoin two datasets side by side on a specified fileCompare two datasets to find common or distinct rowsGroup data by a column and perform aggregate operation on othercolumnsAdd column to an existing dataset add_valueChange case of selected columns change_caseColumn join column_joinColumn Join on Collections collec-

tion_column_joinCompute an expression on every row column_makerConcatenate multiple datasets tail-to-head concatenate_multipleConvert delimiters to TAB convert_charactersCut columns from a table cut_columnsMerge columns together merge_colsColumn regex find and replace regex_find_replaceRemove beginning of a file remove_beginningSelect first lines from a dataset show_beginningSelect last lines from a dataset show_tailSplit file according to the values of a column split_file_on_columnUnique occurrences of each record uniqueText processing tools using the GNU coreutils, sed, awk and friends 8.25 text_processingGNU datamash 1.0.6 suite_datamash

10 Chapter 2. . . . Dedicated to microbiota analyses

Page 15: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

2.3.1.3 Sequence file manipulation

Name Version Galaxy wrapperFilter sequences by id from a tabular file seq_filter_by_idRename sequences with id mapping from a tabular file seq_renameSelect sequences by id from a tabular file seq_select_by_idFastQ to FastA converter fastq_to_fastaFastA to Tabular converter fasta_to_tabularSplit paired end reads split_paired_readsAdd barcodes to Fasta sequences fasta_add_barcodeCombine FASTA and QUAL into FASTQ 1.1.0 fastq_combinerFilter FASTQ reads by quality score and length 1.1.0 fastq_filterConvert between various FASTQ quality formats 1.1.0 fastq_groomerManipulate FASTQ reads on various attributes 1.1.0 fastq_manipulationFASTQ Masker by quality score 1.1.0 fastq_masker_by_qualityFASTQ de-interlacer on paired end reads 1.1.0 fastq_paired_end_deinterlacerFASTQ interlacer on paired end reads 1.1.0 fastq_paired_end_interlacerFASTQ splitter on joined paired end reads 1.1.0 fastq_paired_end_splitterFASTQ Summary Statistics by column 1.1.0 fastq_statsFASTQ to Tabular converter 1.1.0 fastq_to_tabularFASTQ Trimmer by quality 1.1.0 fastq_trimmerFASTQ to FASTA converter 1.1.0 fastqtofastaTabular to FASTQ converter 1.1.0 tabular_to_fastqFastQ joiner [2] 0.0.3 fastq_paired_end_joinerFastQ-join [1] 0.1.1 fastq_join

2.3.1.4 BAM/SAM file manipulation

Name Version Galaxy wrapperSAMTools 1.2 suite_samtools_1_2

2.3.1.5 BIOM file manipulation

Name Version Galaxy wrapperBIOM Format 2.1.5 suite_biom_format

References

2.3.2 Genomics tools

2.3.2.1 Quality control

2.3.2.2 Clustering

Name Version Galaxy wrapperCD-HIT [5][11] 4.6.4 cdhitFormat CD-HIT output format_cd_hit_output

2.3. Tools 11

Page 16: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

2.3.2.3 Sorting and Prediction

Name Version Galaxy wrapperSortMeRNA [6] 2.1b sortmernaFragGeneScan [12] 1.30 fraggenescan

2.3.2.4 Similarity search and alignment

Name Version Galaxy wrapperNCBI BLAST [2][3] 2.2.30 ncbi_blast_plusDiamond [1] 0.8.24 diamondHMMER3 [4] 3.1b2 hmmer3

2.3.2.5 Mapping

Name Version Galaxy wrapperBWA [9][10][8] 0.7.12 bwaBowtie2 [7] 2.3.2 bowtie2

References

2.3.3 Microbiota dedicated tools

2.3.3.1 Metagenomics data manipulation

Name Version Galaxy wrapperVSEARCH [13] 1.9.7 vsearchNonpareil [11] 3.1.1 nonpareil

2.3.3.2 Assembly

Name Version Galaxy wrapperMEGAHIT [7] 1.1.2 vsearchmetaSPAdes [9] 3.9.0 nonpareilmetaQUAST [8] 4.5 quastVALET 1.0 valet

2.3.3.3 Metataxonomic sequence analysis

Name Version Galaxy wrapperMothur [13] 1.36.1 suite_mothurQIIME [4] 1.9.1 suite_qiime

12 Chapter 2. . . . Dedicated to microbiota analyses

Page 17: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

2.3.3.4 Taxonomy assignation on WGS sequences

Name Version Galaxy wrapperMetaPhlAn2 [15] 2.6.0 suite_metaphlan2Format MetaPhlAn2 0.1.0 format_metaphlan2_outputKRAKEN [15] 0.10.5 suite_kraken_0_10_5

2.3.3.5 Metabolism assignation

Name Version Galaxy wrapperHUMAnN2 [1] 0.11.1 suite_humann2Group HUMAnN2 to GO slim term [3] 1.2.0 group_humann2_uniref_abundances_to_goPICRUST [6] 1.1.1 suite_picrustInterProScan [5] 5.0.0 interproscan5

2.3.3.6 Combination of functional and taxonomic results

Name Version Galaxy wrapperCombine MetaPhlAn2 and HUMAnN2 outputs 0.1.0 combine_metaphlan2_humann2

Visualization

Name Version Galaxy wrapperexport2graphlan 0.19 export2graphlanGraPhlAn [2] 1.0.0 suite_graphlanKRONA [11] 2.6.1 taxonomy_krona_chart

References

These tools come with databases, which are automatically downloaded and configured during deployment ofASaiM Galaxy instance.

Name Ver-sion

Comments

SILVA 119 Reduced with HMMER 3.1b1 and SumaClust v1.0.00 and formatted for Sort-MeRNA

ChocoPhlAn 0.1.1 Microbial pangenomesUniRef50 Filtered with Diamond to be HUMAnN2 compatibleMetaPhlAn2 2.2.5 Unique clade-specific marker genes identified from 17,000 reference genomes

2.4 Workflows

To orchestrate tools and help users with their analyses, several workflows populate ASaiM framework. Theyformally orchestrate tools in a defined order and with defined parameters, but they are customizable (tools, order,parameters).

2.4. Workflows 13

Page 18: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

2.4.1 Analysis of raw metagenomic or metatranscriptomic shotgun data

The workflow quickly produces, from raw metagenomic or metatranscriptomic shotgun data, accurate and precisetaxonomic assignations, wide extended functional results and taxonomically related metabolism information

Fig. 2: Main ASaiM workflow to analyze raw sequences. Image available under CC-BY license (https://doi.org/10.6084/m9.figshare.5371396.v3)

This workflow consists of

1. Processing with quality control/trimming (FastQC and Trim Galore!) and dereplication (VSearch [13])

2. Taxonomic analyses with assignation (MetaPhlAn2 [15]) and visualization (KRONA [11], GraPhlAn [2])

3. Functional analyses with metabolic assignation and pathway reconstruction (HUMAnN2 [1])

4. Functional and taxonomic combination with developed tools combining HUMAnN2 and MetaPhlAn2 out-puts

This workflow has been tested on two mock metagenomic datasets with controlled communities (See “Valida-tion”).

2.4.2 Assembly of metagenomic data

To reconstruct genomes or to get longer sequences for further analysis, microbiota data needs to be assembled,using the recently developed metagenomics assemblers.

To help in this task, two workflows have been developed in ASaiM, each one using one of each of the well-performing assemblers [12][16][14][5][10][17][3]

• MEGAHIT [7]

It is currently the most efficent computationally assembler: it has the lowest memory and timeconsumption [16][3][14]. It produced some of the best assemblies (irrespective of sequencingcoverage) with the fewest structural errors [10] and outperforms in recovering the genomes ofclosely related strains [3], but has a bias towards relatively low coverage genomes leading to asuboptimal assembly of high abundant community member genomes in very large datasets [17]

• MetaSPAdes [9]

It is particularly optimal for high-coverage metagenomes [16] with the best contig metrics [5]and produces few under-collapsed/over-collapsed repeats [10]

Both workflows consists of

1. Processing with quality control/trimming (FastQC and Trim Galore!)

14 Chapter 2. . . . Dedicated to microbiota analyses

Page 19: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

2. Assembly with either MEGAHIT or MetaSPAdes

3. Estimation of the assembly quality statistics with MetaQUAST [8]

4. Identification of potential assembly error signature with VALET

5. Determination of percentage of unmapped reads with Bowtie2 [6] combined with MultiQC [4] to aggregatethe results.

2.4.3 Analysis of metataxonomic data

To analyze amplicon data, the Mothur and QIIME tool suites are available to ASaiM. We integrated the workflowsdescribed in tutorials of Mothur and QIIME websites, as example of amplicon data analyses as well as supportfor the training material. These workflows, as any workflows available in ASaiM, can be adapted for a specificanalysis or used as subworkflows by the users.

2.4.4 Running as in EBI metagenomics

The tools used in the EBI Metagenomics pipeline are also available in ASaiM. We integrate then also a workflowwith the same steps as the EBI Metagenomics pipeline (3.0).

Fig. 3: EBI Metagenomics workflow (3.0) in ASaiM

Analyses made in EBI Metagenomics website can be then reproduced locally, without having to wait for availabil-ity of EBI Metagenomics or to upload any data on EBI Metagenomics. However the parameters must be definedby the user as we can not find them on EBI Metagenomics documentation.

References

2.5 Validation

ASaiM framework was tested on two mock metagenomic datasets from HMP metageonomes mock pilot project.These datasets are metagenomics shotgun sequences (>1,200,000 454 GS FLX Titanium single-end sequences)from a controlled community (with 22 known microbial species), available on EBI metagenomic database(SRR072232 and SRR072233).

Results obtained with ASaiM framework were intensively analyzed and compared to the ones from EBI metage-nomic pipeline (version 1.0).

For these analyses, ASaiM framework was deployed on a computer with Debian GNU/Linux System, 8 coresIntel(R) Xeon(R) at 2.40 GHz and 32 Go of RAM. Size of the process in memory is stable over workflow execution

2.5. Validation 15

Page 20: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

(variability inferior to 40 kb). Workflow execution is relatively fast: < 5h and < 5h30 for datasets with 1,225,169and 1,386,198 sequences respectively.

Taxonomic analyses gives a great insight into the community structure with complete, accurate and statisticallysupported information

Fig. 4: Taxonomic results for SRR072233

A broad overview of metabolic profile is available for ASaiM framework with gene families, pathways and GOslim terms. Only GO slim term information can be compared to EBI metagenomics pipeline results. It is difficultto determine which method is the best one as no expected functional information is available.

The taxonomically-related functional information allows to investigate which species is involved in whichmetabolic functions. This type of investigation is specific to ASaiM framework, as EBI metagenomics pipelinedoes not provide a way to link taxonomic and functional results.

Further analyses and detailed comparisons of both tools and their results are organized in the report on this vali-dation. Details about theses analyses are available on the dedicated GitHub repository.

16 Chapter 2. . . . Dedicated to microbiota analyses

Page 21: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

Fig. 5: Example of functional results for SRR072233 with the relative abundance of GO slim terms related tocellular processes

2.5. Validation 17

Page 22: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

Fig. 6: Example of investigation in which species is involved in which metabolic functions for SRR072233:Involved species and their relative involvementin fatty acid biosynthesis pathways

18 Chapter 2. . . . Dedicated to microbiota analyses

Page 23: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

Bibliography

[1] Erik Aronesty. Ea-utils : “Command-line tools for processing biological sequencing data”. 2011. URL: http://code.google.com/p/ea-utils.

[2] Daniel Blankenberg, Assaf Gordon, Gregory Von Kuster, Nathan Coraor, James Taylor, Anton Nekrutenko,and the Galaxy Team. Manipulation of FASTQ data with Galaxy. Bioinformatics, 26(14):1783–1785, July2010. URL: http://bioinformatics.oxfordjournals.org/content/26/14/1783, doi:10.1093/bioinformatics/btq281.

[1] Benjamin Buchfink, Chao Xie, and Daniel H. Huson. Fast and sensitive protein alignment using DIAMOND.Nat Meth, 12(1):59–60, January 2015. URL: http://www.nature.com.gate1.inist.fr/nmeth/journal/v12/n1/full/nmeth.3176.html, doi:10.1038/nmeth.3176.

[2] Christiam Camacho, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, andThomas L. Madden. BLAST+: architecture and applications. BMC Bioinformatics, 10(1):421, December2009. URL: http://www.biomedcentral.com/1471-2105/10/421/abstract, doi:10.1186/1471-2105-10-421.

[3] Peter JA Cock, John M. Chilton, Björn Grüning, James E. Johnson, and Nicola Soranzo. NCBI BLAST+ in-tegrated into Galaxy. GigaScience, 4(1):39, August 2015. URL: http://www.gigasciencejournal.com/content/4/1/39/abstract, doi:10.1186/s13742-015-0080-7.

[4] Robert D Finn, Jody Clements, and Sean R Eddy. Hmmer web server: interactive sequence similarity search-ing. Nucleic acids research, 39(suppl_2):W29–W37, 2011.

[5] Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. CD-HIT: accelerated forclustering the next-generation sequencing data. Bioinformatics, 28(23):3150–3152, December 2012.doi:10.1093/bioinformatics/bts565.

[6] Evguenia Kopylova, Laurent Noé, and Hélène Touzet. SortMeRNA: fast and accurate filteringof ribosomal RNAs in metatranscriptomic data. Bioinformatics, 28(24):3211–3217, December 2012.doi:10.1093/bioinformatics/bts611.

[7] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2. Nature methods,9(4):357–359, 2012.

[8] Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997[q-bio], March 2013. arXiv: 1303.3997. URL: http://arxiv.org/abs/1303.3997.

[9] Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioin-formatics, 25(14):1754–1760, July 2009. doi:10.1093/bioinformatics/btp324.

[10] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows–Wheeler transform.Bioinformatics, 26(5):589–595, March 2010. URL: http://bioinformatics.oxfordjournals.org/content/26/5/589, doi:10.1093/bioinformatics/btp698.

[11] Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein ornucleotide sequences. Bioinformatics, 22(13):1658–1659, July 2006. doi:10.1093/bioinformatics/btl158.

19

Page 24: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

[12] Mina Rho, Haixu Tang, and Yuzhen Ye. Fraggenescan: predicting genes in short and error-prone reads.Nucleic acids research, 38(20):e191–e191, 2010.

[1] Sahar Abubucker, Nicola Segata, Johannes Goll, Alyxandria M. Schubert, Jacques Izard, Brandi L.Cantarel, Beltran Rodriguez-Mueller, Jeremy Zucker, Mathangi Thiagarajan, Bernard Henrissat, OwenWhite, Scott T. Kelley, Barbara Methé, Patrick D. Schloss, Dirk Gevers, Makedonka Mitreva, and CurtisHuttenhower. Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Micro-biome. PLoS Comput Biol, 8(6):e1002358, June 2012. URL: http://dx.doi.org/10.1371/journal.pcbi.1002358,doi:10.1371/journal.pcbi.1002358.

[2] Francesco Asnicar, George Weingart, Timothy L Tickle, Curtis Huttenhower, and Nicola Segata. Compactgraphical representation of phylogenetic data and metadata with graphlan. PeerJ, 3:e1029, 2015.

[3] Bérénice Batut. Group abundances of UniRef50 gene families obtained with HUMAnN2 to Gene Ontol-ogy (GO) slim terms with relative abundances: release v1.2.0. 2016. URL: http://dx.doi.org/10.5281/zenodo.50086.

[4] J. Gregory Caporaso, Justin Kuczynski, Jesse Stombaugh, Kyle Bittinger, Frederic D. Bushman, Eliza-beth K. Costello, Noah Fierer, Antonio Gonzalez Peña, Julia K. Goodrich, Jeffrey I. Gordon, Gavin A.Huttley, Scott T. Kelley, Dan Knights, Jeremy E. Koenig, Ruth E. Ley, Catherine A. Lozupone, Daniel Mc-Donald, Brian D. Muegge, Meg Pirrung, Jens Reeder, Joel R. Sevinsky, Peter J. Turnbaugh, William A.Walters, Jeremy Widmann, Tanya Yatsunenko, Jesse Zaneveld, and Rob Knight. QIIME allows analysisof high-throughput community sequencing data. Nature Methods, 7(5):335–336, May 2010. URL: http://www.nature.com/nmeth/journal/v7/n5/full/nmeth.f.303.html, doi:10.1038/nmeth.f.303.

[5] Sarah Hunter, Rolf Apweiler, Teresa K Attwood, Amos Bairoch, Alex Bateman, David Binns, Peer Bork,Ujjwal Das, Louise Daugherty, Lauranne Duquenne, and others. Interpro: the integrative protein signaturedatabase. Nucleic acids research, 37(suppl_1):D211–D215, 2008.

[6] Morgan GI Langille, Jesse Zaneveld, J Gregory Caporaso, Daniel McDonald, Dan Knights, Joshua A Reyes,Jose C Clemente, Deron E Burkepile, Rebecca L Vega Thurber, Rob Knight, and others. Predictive func-tional profiling of microbial communities using 16s rrna marker gene sequences. Nature biotechnology,31(9):814–821, 2013.

[7] Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. Megahit: an ultra-fastsingle-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinfor-matics, 31(10):1674–1676, 2015.

[8] Alla Mikheenko, Vladislav Saveliev, and Alexey Gurevich. Metaquast: evaluation of metagenome assemblies.Bioinformatics, 32(7):1088–1090, 2015.

[9] Sergey Nurk, Dmitry Meleshko, Anton Korobeynikov, and Pavel Pevzner. Metaspades: a new versatile denovo metagenomics assembler. arXiv preprint arXiv:1604.03071, 2016.

[11] Brian D Ondov, Nicholas H Bergman, and Adam M Phillippy. Interactive metagenomic visualization in aweb browser. BMC bioinformatics, 12(1):385, 2011.

[11] Luis M Rodriguez-r and Konstantinos T Konstantinidis. Nonpareil: a redundancy-based approach to assessthe level of coverage in metagenomic datasets. Bioinformatics, 30(5):629–635, 2013.

[13] Torbjørn Rognes, Frédéric Mahé, Tomas Flouri, Daniel McDonal, and Pat Schloss. Vsearch: VSEARCH1.4.0. 2015. URL: https://github.com/torognes/vsearch.

[13] Patrick D. Schloss, Sarah L. Westcott, Thomas Ryabin, Justine R. Hall, Martin Hartmann, Emily B. Hol-lister, Ryan A. Lesniewski, Brian B. Oakley, Donovan H. Parks, Courtney J. Robinson, Jason W. Sahl, BlazStres, Gerhard G. Thallinger, David J. Van Horn, and Carolyn F. Weber. Introducing mothur: Open-Source,Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communi-ties. Appl. Environ. Microbiol., 75(23):7537–7541, December 2009. URL: http://aem.asm.org/content/75/23/7537, doi:10.1128/AEM.01541-09.

[15] Duy Tin Truong, Eric A. Franzosa, Timothy L. Tickle, Matthias Scholz, George Weingart, Edoardo Pasolli,Adrian Tett, Curtis Huttenhower, and Nicola Segata. MetaPhlAn2 for enhanced metagenomic taxonomic pro-filing. Nat Meth, 12(10):902–903, October 2015. URL: http://www.nature.com.gate1.inist.fr/nmeth/journal/v12/n10/full/nmeth.3589.html, doi:10.1038/nmeth.3589.

20 Bibliography

Page 25: ASaiM Documentation › pdf › asaim › latest › asaim.pdf · 2019-04-02 · Docker session is finished. 2.1. Installation and use 7. ASaiM Documentation, Release 0.1 2.1.9Stoping

ASaiM Documentation, Release 0.1

[15] Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exactalignments. Genome biology, 15(3):R46, 2014.

[1] Sahar Abubucker, Nicola Segata, Johannes Goll, Alyxandria M. Schubert, Jacques Izard, Brandi L.Cantarel, Beltran Rodriguez-Mueller, Jeremy Zucker, Mathangi Thiagarajan, Bernard Henrissat, OwenWhite, Scott T. Kelley, Barbara Methé, Patrick D. Schloss, Dirk Gevers, Makedonka Mitreva, and CurtisHuttenhower. Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Micro-biome. PLoS Comput Biol, 8(6):e1002358, June 2012. URL: http://dx.doi.org/10.1371/journal.pcbi.1002358,doi:10.1371/journal.pcbi.1002358.

[2] Francesco Asnicar, George Weingart, Timothy L Tickle, Curtis Huttenhower, and Nicola Segata. Compactgraphical representation of phylogenetic data and metadata with graphlan. PeerJ, 3:e1029, 2015.

[3] Sherine Awad, Luiz Irber, and C Titus Brown. Evaluating metagenome assembly on a simple defined commu-nity with many strain variants. bioRxiv, pages 155358, 2017.

[4] Philip Ewels, Måns Magnusson, Sverker Lundin, and Max Käller. Multiqc: summarize analysis results formultiple tools and samples in a single report. Bioinformatics, 32(19):3047–3048, 2016.

[5] William W Greenwald, Niels Klitgord, Victor Seguritan, Shibu Yooseph, J Craig Venter, Chad Garner,Karen E Nelson, and Weizhong Li. Utilization of defined microbial communities enables effective evalua-tion of meta-genomic assemblies. BMC genomics, 18(1):296, 2017.

[6] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignmentof short dna sequences to the human genome. Genome biology, 10(3):R25, 2009.

[7] Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. Megahit: an ultra-fastsingle-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinfor-matics, 31(10):1674–1676, 2015.

[8] Alla Mikheenko, Vladislav Saveliev, and Alexey Gurevich. Metaquast: evaluation of metagenome assemblies.Bioinformatics, 32(7):1088–1090, 2015.

[9] Sergey Nurk, Dmitry Meleshko, Anton Korobeynikov, and Pavel A Pevzner. Metaspades: a new versatilemetagenomic assembler. Genome Research, 27(5):824–834, 2017.

[10] Nathan D Olson, Todd J Treangen, Christopher M Hill, Victoria Cepeda-Espinoza, Jay Ghurye, SergeyKoren, and Mihai Pop. Metagenomic assembly through the lens of validation: recent advances in assessing andimproving the quality of genomes assembled from metagenomes. Briefings in Bioinformatics, pages bbx098,2017.

[11] Brian D Ondov, Nicholas H Bergman, and Adam M Phillippy. Interactive metagenomic visualization in aweb browser. BMC bioinformatics, 12(1):385, 2011.

[12] Christopher Quince, Alan W Walker, Jared T Simpson, Nicholas J Loman, and Nicola Segata. Shotgunmetagenomics, from sampling to analysis. Nature Biotechnology, 35(9):nbt–3935, 2017.

[13] Torbjørn Rognes, Frédéric Mahé, Tomas Flouri, Daniel McDonal, and Pat Schloss. Vsearch: VSEARCH1.4.0. 2015. URL: https://github.com/torognes/vsearch.

[14] Alexander Sczyrba, Peter Hofmann, Peter Belmann, David Koslicki, Stefan Janssen, Johannes Droege,Ivan Gregor, Stephan Majda, Jessika Fiedler, Eik Dahms, and others. Critical assessment of metagenomeinterpretation- a benchmark of computational metagenomics software. Biorxiv, pages 099127, 2017.

[15] Duy Tin Truong, Eric A. Franzosa, Timothy L. Tickle, Matthias Scholz, George Weingart, Edoardo Pasolli,Adrian Tett, Curtis Huttenhower, and Nicola Segata. MetaPhlAn2 for enhanced metagenomic taxonomic pro-filing. Nat Meth, 12(10):902–903, October 2015. URL: http://www.nature.com.gate1.inist.fr/nmeth/journal/v12/n10/full/nmeth.3589.html, doi:10.1038/nmeth.3589.

[16] Andries Johannes van der Walt, Marc Warwick Van Goethem, Jean-Baptiste Ramond, Thulani Peter Makha-lanyane, Oleg Reva, and Don Arthur Cowan. Assembling metagenomes, one community at a time. bioRxiv,pages 120154, 2017.

[17] John Vollmers, Sandra Wiegand, and Anne-Kristin Kaster. Comparing and evaluating metagenome assemblytools from a microbiologist’s perspective-not only size matters! PloS one, 12(1):e0169662, 2017.

Bibliography 21