a town on fire - susquehanna universitycomenius.susqu.edu/biol/312/centraliametagenomicscase... ·...

7
Spring 2019 BIOL 312: Microbiology A Town on Fire Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire Instructor: Dr. Tammy Tobin Susquehanna University E-Mail: [email protected]@susqu [email protected] Preparing the Data and Testing Your Hypotheses Names of Team Members: Introduction: During the last class period, you learned how to use LINUX and QIIME, and used QIIME to sort and quality filter your 16S rRNA metagenomic sequences. You also hypothesized a single nitrogen or sulfur cycling species that you thought would be likely to be found in the Centralia soil samples. Today you will use QIIME to prepare your sequence data for phylogenetic and statistical analyses, then you will generate phylogenetic trees and bar charts that will allow you to test your hypotheses and determine how diverse your chosen genus is in the Centralia soils. Getting QIIME Started This should be review, but here are the steps: 1. Start the terminal program (found in applications – utilities). 2. cd to the Desktop where your Centralia-se-single-sequences directory is located. 3. Type in the following command after the $ prompt: source activate qiime2-2018.2 4. Type ‘yes’ in if you are asked to do so, ignore this step if you are not. Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 1

Upload: others

Post on 22-Nov-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

A Town on Fire

Spring 2019BIOL 312: Microbiology

A Town on Fire

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire

Instructor: Dr. Tammy Tobin Susquehanna UniversityE-Mail: [email protected]

Preparing the Data and Testing Your Hypotheses

Names of Team Members:

Introduction: During the last class period, you learned how to use LINUX and QIIME, and used QIIME to sort and quality filter your 16S rRNA metagenomic sequences. You also hypothesized a single nitrogen or sulfur cycling species that you thought would be likely to be found in the Centralia soil samples. Today you will use QIIME to prepare your sequence data for phylogenetic and statistical analyses, then you will generate phylogenetic trees and bar charts that will allow you to test your hypotheses and determine how diverse your chosen genus is in the Centralia soils.

Getting QIIME Started

This should be review, but here are the steps:

1. Start the terminal program (found in applications – utilities).

2. cd to the Desktop where your Centralia-se-single-sequences directory is located.

3. Type in the following command after the $ prompt: source activate qiime2-2018.2

4. Type ‘yes’ in if you are asked to do so, ignore this step if you are not.

5. Type in the password qiime2 at the prompt. Again, ignore this step if you are not asked for the password.

6. You are now running QIIME through your terminal.

Don’t forget to make use of the Filename Completion function to avoid typos. By typing part of the name of a filename or directory and pressing the [Tab] key, the shell will complete the rest of the name automatically. If the shell finds more than one name beginning with those letters you have typed, it will pause, prompting you to type a few more letters before pressing the tab key again. Finally, if you get an error message, you can type History to view your previous commands to see what you did wrong.

Preparing your Data: Using FastTree to Generate Phylogenetic Trees.

Phylogenetic trees diagram the evolutionary relationships between taxa. Trees have nodes, branches and tips. Tips represent the final taxa you are analyzing (Human, Mouse and Fly, or Species A, B, C etc., below). These are often species, but could also be DNA sequences. Nodes are branch points that represent common ancestors of the taxa above them. Branches connect nodes to other nodes or to tips. In our sequence data, branch lengths are related to the number of genetic changes between the ancestral and derived sequences. A clade refers to the grouping that contains the ancestral sequence and all of its descendants.

The tree in this figure is rooted. That is, all taxa are assumed to have derived from the single ancestral species present at the root.

https://www.ncbi.nlm.nih.gov/Class/NAWBIS/Modules/Phylogenetics/phylo7.html

The tree in this figure is unrooted. Relationships between individual species are shown, but no ancestral state is inferred.

https://www.ncbi.nlm.nih.gov/Class/NAWBIS/Modules/Phylogenetics/phylo7.html

Question 1. In the figure above, is species B, E or D more closely related to Species A? Explain.

The underlying assumption for all DNA sequenced-based phylogenetic analyses is that the more closely related two species are, evolutionarily, the more closely related their DNA sequences will be. This underlying assumption does have some flaws that need to be kept in mind. As you have already learned, horizontal gene transfer between species can make two species look more (or less) related than they truly are. Also, not all DNA changes are equal in terms of phenotypic outcome. Some mutations are selectively neutral while others are not. Thus, selective pressures will have an impact on the rate of nucleotide changes observed in different parts of the genome over time. Some analysis metrics take this latter situation into account by weighting base changes in different codon positions differently (to account for silent mutations, etc.). This is not done in 16S rRNA sequence analysis because there are no codons (no protein is produced).

Aligning Sequences

In order to construct a phylogenetic tree, all of the 16S rRNA sequences in our quality-filtered sequence files (the rep-seqs.gza artifact from the last class) will first need to be aligned to make sure that the base changes observed between sequences are due to mutations to that site in the gene, rather than to comparison of two completely different parts of the gene. QIIME will also insert gaps, as needed, to account for the fact that insertions and deletions of bases also occur during evolution. By way of example, take the three related phrases below:

AFATCAT

AFFATCAT

TINYRATFEAREDAFFATCAT

If these phrases were compared without adjusting the default alignment (left justified) above, they would show almost no similarity at all. Even phrases 1 and 2, which are obviously very similar, would have only two letters in common as they are currently aligned: the first A and F. The third sequence (TI) would not match at all. After those first two letters, almost every subsequent letter is different. Aligned versions of these phrases are shown below:

A_FATCAT

AFFATCAT

TINYRATFEAREDAFFATCAT

In this scenario, QIIME has shifted phrases 1 and 2 over to the right, so they match the corresponding phrase in 3, and has also inserted a gap in phrase 1, to account for the additional F’s in phrases 2 and 3. This will give a much more accurate picture of the overall sequence identity.

1. The command for aligning sequences in QIIME is qiime alignment mafft. The input file is your quality-trimmed sequence (rep-seqs.qza), and the output file will be named aligned-rep-seqs.qza (see the script below). To proceed, copy and paste the script below into the terminal command window. This step may take several minutes, so be patient and do not hit return or enter another command until you see the $ prompt.

qiime alignment mafft \

--i-sequences rep-seqs.qza \

--o-alignment aligned-rep-seqs.qza

2. Next, you need to filter out sequences that are highly variable and therefore just ‘noise’. Copy and paste the following script into the terminal window.

qiime alignment mask \

--i-alignment aligned-rep-seqs.qza \

--o-masked-alignment masked-aligned-rep-seqs.qza

Question 2: What is the name of the output file that will result from the command in step 2?

3. Now the aligned sequences are ready to be used to make a phylogenetic tree using FastTree. This program generates a “minimum distance” tree. It assumes that the most parsimonious tree is one that requires the fewest mutations to generate it, and then constructs that tree. The command is:

qiime phylogeny fasttree \

--i-alignment masked-aligned-rep-seqs.qza \

--o-tree unrooted-tree.qza

4. Note that the output for this tree is ‘unrooted’. Since our subsequent analyses will require rooted trees, you need to run a command that roots the tree assuming that the root is the midpoint of the longest tip to tip distance of the unrooted tree. The command is:

qiime phylogeny midpoint-root \

--i-tree unrooted-tree.qza \

--o-rooted-tree rooted-tree.qza

Question 3: In the unrooted tree to the right, point an arrow to where you think QIIME could place the root (or describe where it would be on the tree).

5. The phylogenetic tree you just generated, if you printed it out at this point, would show all of the sequences and their relationships, but would not have any taxonomic assignments (genus, species, etc.). So, the next step in this analysis will be to assign taxonomy to your sequences. QIIME will do this for you. To understand how it does this, first remember that the more identical two sequences are, the more evolutionarily related they are considered to be. By way of example, two 16S rRNA gene sequences are often considered to have come from the same species if they share 97% sequence identity (genus and higher level taxonomic similarities require less sequence identity). Taxonomic groups based on DNA sequences are called ‘operational taxonomic units’ (OTUs). Your sequences will be assigned to specific taxonomic groups by comparing them to a reference 16S rRNA gene database with assigned taxonomies called Greengenes, using a classifier program. The script is below. Do not worry if this step takes quite a few minutes. The classifier has a lot of work to do!

qiime feature-classifier classify-sklearn \

--i-classifier gg-13-8-99-515-806-nb-classifier.qza \

--i-reads rep-seqs.qza \

--o-classification taxonomy.qza

6. Finally, let’s see how well that last script worked. First, run the following script to convert your QIIME output artifact (qza) into a viewable file (qzv)

qiime metadata tabulate \

--m-input-file taxonomy.qza \

--o-visualization taxonomy.qzv

7. Now use the qiime tools view program. It will to open up a browser window with your results in it.

qiime tools view taxonomy.qzv

You will see a window that will look something like this:

At this point, search for your hypothesized taxon in the search box. Do this by entering the genus name. Look at the results. Is your species there? If not, don’t worry. The primers we use will not necessarily identify species. If your genus is there, proceed as if your species was there. If your genus is not there, then choose one of the other genera from your team members’ original case study hypothesis worksheet. Once you find a genus that works, use that for the rest of the analyses. You can now go back to your terminal window.

8. Type in ‘q’ to quit qiime tools view and return to your $ prompt.

Question 4. Let’s say you are a member of a team that chose a genus that is not present in our dataset. Does this absence absolutely refute your hypothesis that the genus was present in Centralia soils? Give two reasons why the answer may be ‘no’. We will discuss this in class next time.

9. The last bit of preparation that you need for today is to determine the sampling depth that you will use for your statistical analyses in the next class. You must use exactly the same number of sequences from each sample site (Cen02, etc.) for these analyses. However, your quality filtered files (summarized in table.qza and viewable in table.qzv) do not have equal numbers of sequences from each sample, so you must decide how many sequences to sample from each site. The sampling depth will be the maximum number of sequences you can use before one of the sampling sites is eliminated (because it does not have that many quality filtered sequences in it).

a. Use the qiime tools view script to open up table.qzv in a browser window using the following script.

qiime tools view table.qzv

b. You should see something like this:

c. Next, click on ‘Interactive Sample Detail’ at the top of the window and click the dropdown menu from ‘Barcode’ to ‘Sample’. You will now see something like this:

d. Each bar represents the sequences in a different sampling site. If you slide the bar under ‘Sampling Depth’ to the right, you will see that as you choose to use more and more sequences (greater sampling depth), sampling sites become gray one at a time. This means those samples do not have that many sequences left in them and will be removed from the downstream data analyses if you sample at that depth.

Question 5. Which sampling site has the most sequences in this dataset?

e. Now slowly slide the bar back until all samples are blue again.

Question 6. Which sample had the fewest sequences?

f. At this point, finding the exact sampling depth to use would be a matter of using the slide bar and the arrows up and down to find the maximum number of sequences that it would be possible sample before one of the sites (C02) was eliminated. That number is sampling depth that you will use on the last day of this case study, and it is 105601. That’s right, you will be using over 105,000 sequences per site! Type that number into the sampling depth box. Then hit the up arrow. At 105602 sample C02 disappears. This will confirm I have chosen the right number. You are now done for the day!

g. Go back to the terminal window, hit ‘q’, then type ‘exit’ and quit the terminal.

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire

4