novogene meta demo report meta demo report.pdfthe bioinformatics analysis will be carried on with...

17
Novogene META Demo Report

Upload: others

Post on 18-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

Novogene META Demo Report

Page 2: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

1 Introduction

Microbial populations exist in almost every ecological community in the earth. From

the insect gut to the oceans, and also can be found in the sediment beneath. For most

of the history, life on the earth consisted solely of microscopic life forms, and

microbial life still dominates Earth in many aspects. Microbes are not only ubiquitous,

and they are essential to all forms of life, as the primary source for nutrients, and the

primary recyclers of dead matter back to available organic form [1].

For a long time, microbial ecologists were mostly restricted to pure cultures of

cultivable isolates to shed light on the diversity and functions of environmental

microbes [2]. Pure cultures allow the study of an isolate’s metabolism and of its gene

repertoire by genome sequencing. Both provide valuable information for extrapolating

on the isolate’s ecophysiological role. Cultivability of environmental microbes often

ranges below 1% of the total bacteria, limiting the research of microbial diversity.

We now have the ability to obtain genomic information directly from microbial

communities in their natural habitats. Instead of looking at a few species individually,

we are able to study tens of thousands all together. Metagenomics is defined as the

direct genetic analysis of genomes of all samples (the community) in a certain

environment [3]. It refers to the study in which the genomic DNA obtained from

microorganisms cannot be cultured in the laboratory. Metagenomics provides the

access to functional gene composition of microbial communities; thus giving a much

broader description than phylogenetic surveys, which are only based on the diversity

of one gene for most of the time, for instance the 16S rRNA gene. The classical

metagenome approach involves cloning of environmental DNA into vectors with the

help of ultra-competent bioengineered host strains. The obtained clone libraries are

subsequently screened either for dedicated marker genes (sequence-driven approach)

or metabolic functions (function-driven approach)[2].

Nowadays, shotgun metagenomicsis is commonly used to study the gene

inventories of microbial communities [4,5,6]. With the rapid development of

next-generation sequencing, large metagenome and multiple metagenome study have

been generated. There were many pioneering metagenomic studies in different

environments, such as the NIH Human Microbiome Project

(HMP,http://www.hmpdacc.org/), and the Earth Microbiome

Project(EMP,http://www.earthmicrobiome.org/).

Page 3: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

2 Project Process

2.1 Experimental Procedures

After the DNA sample(s) is delivered, we will conduct the sample quality test first.

Then we use this (those) qualified DNA sample(s) to construct library(s). At last, the

qualified library(s) will be used for sequencing. The bioinformatics analysis will be

carried on with sequencing data. From samples to raw data, every aspect of sample

testing, library construction and sequencing could impact the quality and quantity of

the data, and the data’s quality will directly affect the results of follow-up

bioinformatics analysis. In order to ensure the accuracy and reliability of the

sequencing data from the beginning, each step of sample testing, library construction

and sequencing is strictly controlled by Novegene, ensuring the output of high-quality

data fundamentally.

2.1.1 Quality Control (QC)

There are three key methods in QC for DNA samples:

(1) DNA degradation degree and potential contamination were monitored on 1%

agarose gels.

(2) DNA purity (OD260/OD280, OD260/OD230) was checked using the

NanoPhotometer® spectrophotometer (IMPLEN, CA, USA).

(3) DNA concentration was measured using Qubit®dsDNA Assay Kit in Qubit® 2.0

Flurometer (Life Technologies, CA, USA).

OD value was between 1.8~2.0, and for the DNA with the content above 1ug, were

used for library construction.

2.1.2 Library Construction

A total amount of 1μg DNA (per sample) was used as an input material for the DNA

sample preparation. Sequencing libraries were generated using NEBNext® Ultra™

DNA Library Prep Kit by Illumina (NEB, USA) following manufacturer’s

recommendations, and index codes were added to attribute sequences to each sample.

Briefly, the DNA sample was fragmented to 300bp by sonication; then the inheriting

DNA fragments were end-polished, A-tailed, and ligated with the full-length adaptor

for Illumina sequencing with further PCR amplification. At last, PCR products were

Page 4: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

purified (by AMPure XP system) and library-generated (for size distribution) by

Agilent2100 Bioanalyzer, quantified by real-time PCR.

2.1.3 Sequencing

The clustering of the index-coded samples was performed on a cBot Cluster

Generation System according to the manufacturer’s instructions. After cluster

generation, the prepared libraries were sequenced on an Illumina HiSeq2500 platform

and paired-end reads were generated.

2.2 Bioinformatics Analysis

The initial step of metagenomic data analysis requires the execution of certain

pre-filtering steps.d

Following by assembling DNA sequence reads into contiguous consensus sequences

(contigs), and annotating taxonomy and abundance as well as prediction of genes [7].

Then the functional annotation was accomplished by analyzing of Metabolic Pathway,

Cluster of Orthologous Group (COG) and Carbohydrate-active Enzymes.

To determine the similarity or difference of taxonomic and functional components

between different samples, relative Clustering analysis and Principal Component

Analysis (PCA) were performed. Meanwhile, there were a series of advanced analysis

items available to explore the environmental samples in depth, such as LEfSe,

Significant Difference, CCA/RDA, NMDS, prediction of Secreted protein and Type

III secretion system effector, annotation of VFDB, ARDB and PHI-base, Comparative

metabolic pathway analysis of multiple samples et al.

3.Analysis Results

3.1 Sequence Pre-filtering

The first step of data analysis requires the execution of certain pre-filtering steps,

including:

a) Remove low-quality reads of which the ratio of low quality bases(≤5 ) is over 40%

b) Remove N-rich reads of which the ratio of N bases is over 10%

c) Remove adapter-polluted reads which have the overlap longer than 15 bp with

adapter

d) Remove host contamination (in case that the identity of any reads with host

genome is higher than 90%)

Table 3.1 The statistic of data pre-filtering

Page 5: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

#Sample: Sample name; InsertSize: the clone insert size of library; SeqStrategy: sequencing

strategy, i.e. (125:125) means pair-end reads of 125bp; RawData: raw sequencing data in Mbp;

CleanData: effective data obtained from pre-filtering in Mbp; Clean_Q20/Clean_Q30: the

proportion of bases having a quality score equal or higher than 20(error rate <1%) and 30(error

rate <0.1%) respectively of clean data; Clean_GC(%):the percentage of G and C bases in clean

data. Effective (%): the ratio of effective data account for raw data;

3.2 Assembly

SOAP de novo [8] package was utilized to perform metagenomic assembling with

different K values (default 49, 55 and 59). The best assembly result of Scaffold,

which has the largest N50, was selected for the subsequent analysis. The Scaftigs

numbers and length distribution were then obtained and shown in Table 3.2 and

Figure 3.2 respectively.

Table 3-2 The Statistic of Scaftigs of Each Sample

#SampleID: sample name; Total Len.: the total length of all Scaftigs; Num.: the total number of

all Scaftigs; Average Len.: the average length of all Scaftigs; N50/90 Len.: the length of

N50/N90; Max Len.: the length of longest scaftig.

Page 6: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

Fiure 3-2 The length distribution of Scaftigs of each sample. The Y1-axis titled “Frequence (#)”

means the numbers of Scaftigs of certain length; The Y2-axis titled “Percentage (%)” means the

percentage of Scaftigs of certain length accounts for the total Scaftigs; The X-axis titled

“ScaftigsLength(bp)” indicates the length of Scaftigs.

3.3 Species Annotation and Abundance

3.3.1 Species Annotation

CD-HIT was used to cluster Scaftigs derived from assembly with a default identity of

0.95. In order to analyze the relative abundance of scaftigs further in each sample, the

clean reads after pre-processing were mapped with the non-redundant Scaftigs dataset

by SoapAligner [10] firstly. Then the Scaftigs of which the total depth equal to 0 was

filtered, and at last the abundance table of filtered Scaftigs was obtained.

The corresponding Scaftigs were mapped to the mass of Bacteria, Fungi, Archaea and

Viruses data extracted from the NT database of NCBI (E-value ≤1e-05). LCA

algorithm (Lowest common ancestor, applied in MEGAN [11] software system) was

used to ensure the annotation significance by picking out the lowest common

classified ancestor for final display.

According to the results of LCA annotation and Scaftigs abundance, relative

abundance table on different levels were calculated. Top ten phyla were selected and

displayed in the Figure 3.3.1

Page 7: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

Figure 3.3.1 Phylum distribution of all samples. Plotted by the “Relative Abundance”

along the Y-axis and “Samples Name” on the X-axis, where “Others” represents a

total relative abundance of the entire phylum besides the top 10 phylum.

3.3.2 Species Abundance Cluster

Selecting the dominant 35 genera among all samples was based on the results of

species annotation and abundance information of all samples in genus level. The

abundance distribution of these dominant 35 genera was displayed in the Species

abundance Heat-map. The result is shown in Figure 3.3.2.

Figure 3.3.2 Species abundance Heat-map. Plotted by sample name on the X-axis and

selected genera on the Y-axis. The absolute value of 'Z' represents the distance

between the raw score and the population mean in units of the standard deviation. 'Z'

is negative when the raw score is below the mean, positive when above.

Page 8: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

3.4 Gene Prediction and Abundance

3.4.1 Gene Prediction

Scaftigs over 300bp were selected for gene prediction by MetaGeneMark [13]

software. MetaGeneMark [13] software is widely used in metagenomic analysis and

unknown prokaryotes prediction based on a Heuristic algorithms and a tremendous

training set. The result is shown in table 3-4-1.

Table 3.4.1 Gene prediction basic information

#ORF NO. : the number of ORFs (Open reading frame) ; Total Len. (Mbp): the total length of all

ORFs; Average Len. (Mbp): the average length of all ORFs; GC Percent: the GC content of the OR

Fs .

Figure 3.4.1 The length distribution of the predict genes. Plotted by the number of the

genes along the Y1-axis, percentage of genes along the Y2-axis, and the length of

genes along the X-axis.

3.4.2 Eliminating Redundancy of Genes

The gene protein sequences were clustered at the threshold 95%[14] sequence

similarity by CD-HIT [15,16] software and then select the longest one as the

representative sequence. The length distribution of the representative sequences is

shown in Figure 3.4.2

Page 9: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

Figure 3.4.2 The length distribution of the representative genes. Plotted by the number

of the genes along the Y1-axis, percentage of genes along the Y2-axis, and the length

of genes along the X-axis.

3.4.3 Gene Abundance

In order to analyze the gene abundance of each sample, combining the initiation site

and termination site of each gene on Scfatigs and the single base’s depth of Scaftigs

that genes come from, we can obtain the table of representative gene’s abundance in

each sample. Part of representative genes abundance are shown in Table 3.4.3

Table 3.4.3 Part of representative genes abundance

Displayed by gene's ID number vertical and sample names transversal. Each grid

represents the abundance distribution of certain gene in corresponding sample.

3.5 Functional Annotation and Abundance Analysis

Sequences were annotated to functional categories against the databases (KEGG/

eggNOG /CAZydatabases) using BLAST, and the result with minimum-value was

selected. The BCR ( BLAST Coverage Ratio of Gene ) of reference and query gene

was selected with cutoff ≥ 40%. The index of BCR is:

Page 10: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

BCR (Ref.) = (Match/Length(R))×100%;

BCR (Que.) = (Match/Length (Q)) ×100%;

Where Match is the available alignment length between reference and query

gene.Length(R) is the length of reference gene. Length (Q) is the length of query

gene.

3.5.1 Functional Annotation

KEGG is a database resource for understanding high-level functions and utilities of

the biological systems, such as the cell, the organism and the ecosystem, from

genomic and molecular-level information (http://www.genome.jp/kegg/). KEGG is

the reference knowledge base that integrates current knowledge on molecular

interaction networks such as pathways, complexes (PATHWAY database), and the

information about genes as well as those proteins generated by genome projects

(GENES/SSDB/KO databases) and information about biochemical compounds and

reactions (COMPOUND/GLYCAN/REACTION databases). The PATHWAY database

is a collection of manually drawn maps called the KEGG reference pathway maps,

each corresponding to a known network of functional significance. Reflecting the map

resolution and functional modules at different levels, these pathway maps are

hierarchically classified. There are seven categories in the top level (Metabolism,

Genetic Information Processing, Environmental Information Processing, Cellular

Processes, Organismal Systems, Human Diseases and Drug Development) and 54

subcategories in the second level. The third level in the hierarchy corresponds to

individual pathway maps. The fourth level corresponds to KO (KEGG Orthology)

entries [17-19].

The eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups)

is a database of orthologous groups of genes. The orthologous groups are annotated

with functional description lines (derived by identifying a common denominator for

the genes based on their various annotations), with functional categories (i.e derived

from the original COG/KOG categories) (http://eggnog.embl.de). The eggNOG's

database currently counts 1.7 million orthologous groups in 3686 species, covering

over 7.7 million proteins (built from 9.6 million proteins)[20].

The CAZy database describes the families of structurally related catalytic and

carbohydrate-binding modules (or functional domains) of enzymes that degrade,

modify, or create glycosidic bonds. (http://www.cazy.org/) [21,22]. The Enzyme

Classes are currently including Glycoside Hydrolases (GHs), Glycosyl Transferases

(GTs), Polysaccharide Lyases (PLs), Carbohydrate Esterases (CEs), and Auxiliary

Activities (AAs). The Associated Modules are currently covered the

Carbohydrate-Binding Modules (CBMs).

Page 11: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

Fig 3.5.1-1 Abundance of genes annotated to the functional databases. Top panel:

KEGG database. Middle panel: eggNOG database. Low panel: CAZy database.

AfterKEGG database annotating, genes were displayed using KEGG maps. One of

the maps is shown below.

Fig3.5.1-2 Map of TCA cycle pathway

3.5.2 Abundance Analysis of Functional Genes

Genes were annotated to functional databases at different levels. Relative abundance of

annotated genes by the top level of each functional database is shown below.

Page 12: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

Fig3.5.2 Relative abundance of annotated genes by the top level of different functional

databases. Top panel: KEGG database. Middle panel: eggNOG database. Low panel:

CAZy database.

3.5.3 Abundance Heat-map of Functional Genes

Functional distribution abundance distribution of top 35 genera among all samples is

displayed in the following heat-map.

Fig3.5.3 Abundance Heat-map of functional genes. Plotted by sample name on X-axis

and genes on Y-axis. The absolute value of 'Z' represents the distance between the raw

score and the population mean in units of the standard deviation. 'Z' is negative when

the raw score is below the mean, positive when above.

Page 13: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

3.6 Comparative Analysis (between samples)

3.6.1 PCA on Communities

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal

transformation to convert a set of observations of possibly correlated variables into a

set of values of linearly uncorrelated variables called principal components [23]. The

first principal component accounts for as much of the variability in the data as

possible, and each succeeding component accounts for as much of the remaining

variability as possible.

Figure 3.6.1 Principal component analysis on the relative abundance of phylum level.

Each point represents a sample, plotted by the second principal component on the

Y-axis and the first principal component on the X-axis, which was colored by group.

3.6.2 PCA on Functional Genes

Page 14: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

Figure 3.6.2 Principal component analysis on the relative abundance of functional

genes from three databases. Top panel: KEGG database. Middle panel: eggNOG

database. Low panel: CAZy database. Each point represents a sample, plotted by the

second principal component on the Y-axis and the first principal component on the

X-axis, which was colored by group.

3.6.3 Cluster Analysis on Communities

The Bray-Curtis index is a statistic used to quantify the compositional dissimilarity

between two different samples, based on the counts of each sample. In ecology and

biology study, the Bray-Curtis dissimilarity is bound between 0 and 1, where “0”

means the two sites have the same composition (because they share all the species),

and “1” means the two sites do not share any species. At the sites where BC is

intermediate (e.g. BC = 0.5) this index differs from other commonly used indices.

Figure 3.6.3 Clustering tree based on Bray–Curtis distance. Plotted with clustering

tree on the left and the relative phylum-level abundance map on the right side.

Page 15: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

3.6.4 Cluster Analysis on Functional Genes

Figure 3.6.4 Clustering tree based on Bray–Curtis distance. Plotted with clustering

tree in the centre and the functional genes relative abundance from top level of three

databases in the outer ring. Top panel: KEGG database. Middle panel: eggNOG

database. Low panel: CAZy database.

Reference

[1] Wooley J C, Godzik A, Friedberg I. A primer on metagenomics[J]. PLoS

computational biology, 2010, 6(2): e1000667.

[2] Teeling H, Glöckner F O. Current opportunities and challenges in microbial

metagenome analysis—a bioinformaticperspective[J]. Briefings in bioinformatics,

2012: bbs039.

[3] Thomas T, Gilbert J, Meyer F. Metagenomics-a guide from sampling to data

analysis[J]. Microb Inform Exp, 2012, 2(3).

[4] Tringe, S. G., Von Mering, C., Kobayashi, A., Salamov, A. A., Chen, K., Chang,

H. W., ... & Rubin, E. M. (2005). Comparative metagenomics of microbial

communities. Science, 308(5721), 554-557.

Page 16: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

[5] Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K. S., Manichanh,

C., ...&Weissenbach, J. (2010). A human gut microbial gene catalogue established by

metagenomic sequencing. Nature, 464(7285), 59-65.

[6] Raes, J., Foerstner, K. U., & Bork, P. (2007). Get the most out of your

metagenome: computational analysis of environmental sequence data. Current

opinion in microbiology, 10(5), 490-498.

[7] Mende DR, Waller AS, Sunagawa S, Jarvelin AI, Chan MM, Arumugam M, Raes J, Bork P:

Assessment of metagenomic assembly using simulated next generation sequencing data.

PLoS One 2012, 7(2):e31386.

[8] Luo et al.: SOAPdenovo2: an empirically improved memory-efficient short-read

de novo assembler. GigaScience 2012 1:18.

[9] Edgar, R. C.(2010). Search and clustering orders of magnitude faster than BLAST.

Bioinformatics 26: 2460–2461

[10] Liu P, Fang X, Feng Z, Guo YM, . 2011. Direct sequencing and characterization

of a clinical isolate of Epstein-Barr virus from nasopharyngeal carcinoma tissue by

using next-generation sequencing technology. J. Virol. 85:11291–11299

[11] Huson, Daniel H., et al. "Integrative analysis of environmental sequences using

MEGAN4." Genome research 21.9 (2011): 1552-1560.

[12] Yok NG, Rosen GL: Combining gene prediction methods to improve

metagenomic gene annotation. BMC Bioinformatics 2011, 12:20.

[13] Zhu, Wenhan, AlexandreLomsadze, and Mark Borodovsky. "Ab initio gene

identification in metagenomic sequences." Nucleic acids research 38.12 (2010):

e132-e132

[14] Karlsson FH, Tremaroli V, Nookaew I, Bergstrom G, Behre CJ, Fagerberg B,

Nielsen J, Backhed F: Gut metagenome in European women with normal, impaired

and diabetic glucose control. Nature 2013, 498(7452):99-103.

[15] Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets

of protein or nucleotide sequences. Bioinformatics 2006, 22(13):1658-1659.

[16] Fu L, Niu B, Zhu Z, Wu S, Li W: CD-HIT: accelerated for clustering the

next-generation sequencing data. Bioinformatics 2012, 28(23):3150-3152.

[17] Kanehisa M (1997). A database for post-genome analysis. Trends Genet 13 (9):

375–376.

Page 17: Novogene META Demo Report META Demo Report.pdfThe bioinformatics analysis will be carried on with sequencing data. From samples to raw data, every aspect of sample testing, library

[18] Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M (2004). The KEGG

resource for deciphering the genome. Nucleic Acids Res 32 (Database issue):

D277–280.

[19] Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, et al

(2006). From genomics to chemical genomics: new developments in KEGG. Nucleic

Acids Res 34(Database issue): D354–357.

[20] Powell S, Forslund K, Szklarczyk D, et al (2014). eggNOG v4.0: nested

orthology inference across 3686 organisms. Nucleic Acids Res 42 (Database issue):

D231–239.

[21] Cantarel BL, Coutinho PM, Rancurel C, Bernard T, Lombard V, Henrissat B

(2009). The Carbohydrate-Active EnZymes database (CAZy): an expert resource for

Glycogenomics. Nucleic Acids Res 37:D233-238.

[22] Lombard V, Ramulu HG, Drula E, Coutinho PM, and Henrissat B (2014). The

carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res 42

(Database issue): D490–495.

[23] Ekaterina A, Frisli T, and Rudi K. De novo Semi-alignment of 16S rRNA Gene

Sequences for Deep Phylogenetic Characterization of Next Generation Sequencing

Data. Microbes and Environments 28.2 (2013): 211-216.