a town on fire - susquehanna universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · web...

15
Spring 2018 BIOL 312: Microbiology A Town on Fire Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire Instructor: Dr. Tammy Tobin Susquehanna University E-Mail: [email protected]@susqu [email protected] Overview During the first case study session, you learned about the history and biogeochemistry of the Centralia, PA mine fire. As you recall, surface soil temperatures in fire-affected areas regularly exceed 60°C and soils surrounding the vents are often rich in combustion products such as sulfur and nitrogen that microbial communities can use and transform as a part of their energy- generating processes. During the next two classes, you will use that background information as well as information found in papers that describe ‘typical’ geothermal soils and their microbial communities to hypothesize a single bacterial nitrogen- or sulfur-cycling genus that you would expect to find living in Centralia’s fire-affected soils. You will then test that hypotheses using metagenomics analysis of 16S rRNA gene sequences from 4 Centralia soil samples using a LINUX-based computer program called Quantitative Insights In Microbial Ecology 2 (QIIME 2, Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 1

Upload: others

Post on 22-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Town on Fire - Susquehanna Universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · Web view- The shell acts as an interface between the user and the kernel. When a user

Spring 2018BIOL 312: Microbiology

A Town on FireMetagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire

Instructor: Dr. Tammy Tobin Susquehanna UniversityE-Mail: [email protected]

OverviewDuring the first case study session, you learned about the history and biogeochemistry of the Centralia, PA mine fire. As you recall, surface soil temperatures in fire-affected areas regularly exceed 60°C and soils surrounding the vents are often rich in combustion products such as sulfur and nitrogen that microbial communities can use and transform as a part of their energy-generating processes.

During the next two classes, you will use that background information as well as information found in papers that describe ‘typical’ geothermal soils and their microbial communities to hypothesize a single bacterial nitrogen- or sulfur-cycling genus that you would expect to find living in Centralia’s fire-affected soils. You will then test that hypotheses using metagenomics analysis of 16S rRNA gene sequences from 4 Centralia soil samples using a LINUX-based computer program called Quantitative Insights In Microbial Ecology 2 (QIIME 2, pronounced ‘chime’). Ultimately, you will make a short oral presentation and write a full-length research article to report your findings and predict the

types of impacts that members of your genus might be having on the Centralia ecosystem.

Figure 1: Above: Centralia, PA prior to its evacuation in 1984. The town had over 1800 residents, several businesses and churches. Right: Old Route 61 through Centralia (taken in 1997) showing steam, rich in carbon monoxide, venting upward through cracks caused by land collapses.

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 1

Page 2: A Town on Fire - Susquehanna Universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · Web view- The shell acts as an interface between the user and the kernel. When a user

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 2

Page 3: A Town on Fire - Susquehanna Universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · Web view- The shell acts as an interface between the user and the kernel. When a user

Learning Objectives:As a result of participation in these activities, students will be able to:

1. Explain each step in the generation and analysis of pyrosequencing-based metagenomic 16S rRNA sequence data.

2. Discuss the basic evolutionary assumptions that underlie metagenomic sequence analysis.

3. Evaluate the strengths and weaknesses of the methods employed in the metagenomic analysis of microbial populations, including the impact that data quality has on bioinformatics analyses.

4. Choose and justify the appropriate methods for metagenomics analysis of bacterial community structure.

5. Propose valid hypotheses regarding bacterial community structure, and use bioinformatics to test those hypotheses.

EvaluationThe final evaluation of this project will be based on the successful completion of Team Application Activities and a Final Paper and Presentation.

Figure 2: Steam from “Anthracite Smokers” in Centralia, PA carries dissolved combustion products, such as nitrogen and sulfur, to the surface through soil fractures.

MaterialsRecommended Readings:A Primer on Metagenomics

Computer Resources:Macintosh computer running Java version 7 or higher.

Access to Amazon Web Services EC2 large instance or native install of QIIME 2 https://docs.qiime2.org/2018.2/install/native/#install-qiime-2-within-a-conda-environment .

FigTree

Metagenomics Sequence Resources:Centralia Metagenomics files C02, C09, C10 and C13 Raw sequences were submitted to the GenBank SRA Accession SRP082686.

Team Application Activities:IntroductionStudents learn about the history and biogeochemistry of the Centralia Mine Fire environment.

Team Activity #1Students work in teams in order to familiarize themselves with LINUX and QIIME.

Team Activity #2Students propose hypotheses regarding the types of microbial species they expect to see in thermophilic versus mesophilic soils in Centralia and use QIIME to test their hypotheses.

Team Activity #3Students use QIIME and FigTree to perform alpha diversity and phylogenetic analyses and begin to prepare their presentations.

Final PresentationEach student team presents their metagenomic findings.

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 3

Page 4: A Town on Fire - Susquehanna Universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · Web view- The shell acts as an interface between the user and the kernel. When a user

As the steam rises it cools and precipitates chemicals into the surrounding soils where they can be utilized and transformed by nitrogen and sulfur-cycling bacterial communities.

Team Application Activity #1: Introduction to Metagenomic Analysis, LINUX and QIIMEMetagenomic Analysis of Bacterial 16S rRNA genes As you probably recall from your introductory biology classes, protein synthesis (translation) is catalyzed by a structure called the ribosome, a complex structure composed of both proteins and ribosomal RNA (rRNA). Translation begins when the small subunit of the ribosome locates and binds to the 5’ end of the mRNA. Once this has happened, the large ribosomal subunit can attach, and the complete ribosome translocates along the mRNA, catalyzing the formation of peptide bonds between amino acids as they are conveyed to the correct mRNA codon by tRNA.

In order for this process to work correctly, the ribosome must first be able to find the 5’ end of an mRNA and bind to it. That recognition is the job of the 16S rRNA, which contains a 3’ sequence that is complementary to the 5’ end of the mRNA. The 16S rRNA genes have been used extensively for phylogenetic analysis since Carl Woese and George Fox first proposed their use in 1977 (Woese and Fox, 1977). Because of its critical function in translation, the 16S rRNA gene contains highly conserved sequences that can be used to design ‘universal’ PCR primers (primers that work for almost all species), as well as highly variable regions that allow taxon identification based on sequence comparison to known taxa. Well-curated databases, such as the Ribosomal Database Project (Cole, et al. 2003) and the Greengenes database (DeSantis et al 2006) contain regularly updated versions of all of the known 16S rRNA gene sequences, along with their phylogenetic assignments, and are invaluable in this process.

Metagenomic analysis (Handelsman, et al. 1998) is the analysis of genetic samples recovered directly from the environment, without any attempt to isolate the microbes from which they came. This type of analysis allows microbiologists to study the vast numbers of uncultured, or unculturable, microbes in any environment. Current high-profile examples of this type of analysis include the Human Microbiome project, the Earth Microbiome project and the Maternal Microbiome project.

In preparation for this case study, soil was collected from 4 boreholes in Centralia, PA (13.6°C, 34.2°C, 54.2°C and 57.4°C), and genomic DNA was directly isolated from the samples using the MoBio Powersoil Kit. PCR with universal bacterial 16S rRNA V4 region primers was then used to make copies of all of the bacterial 16S rRNA genes in each of these samples. These PCR products were then used as the template for Roche 454 pyrosequencing at the Penn State University genomics lab. You will be using this data to test hypotheses regarding the types of bacteria that live in the hot soils overlying the Centralia, PA mine fire. But first, you must learn a bit about the program that you will be using to perform the analyses.

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 4

Page 5: A Town on Fire - Susquehanna Universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · Web view- The shell acts as an interface between the user and the kernel. When a user

An Introduction to QIIME 2(Adapted from Regina Lamella, GCAT-SEEK Metagenomics Workshop,

Summer 2013)

Quantitative Insights into Microbial Ecology (QIIME) is an open source pipeline that runs in a LINUX environment. It can be used to process next generation sequencing data in a variety of ways that range from making sure that all of your sequences are of high enough quality to be used (quality trimming), to performing a whole suite of phylogenetic and statistical analyses on the quality trimmed data. We will be utilizing many of these functions in this case study, but first you must get used to working in the LINUX environment using the Mac Terminal, which is part of the operating systems on all Macs. The Linux and QIIME tutorials that follow are largely the work of Dr. Regina Lamandella at Juniata College. I have tweaked them a bit to be appropriate for our operating system and case study.

Unix/Linux Tutorial Linux is an open-source Unix-like operating system. It allows the user considerable flexibility and control over the computer by command line interaction. Many bioinformatics pipelines are built for the Unix/Linux environment; therefore it is a good idea to become familiar with Linux basics before beginning bioinformatics.

Every desktop computer uses an operating system. The most popular operating systems in use today are Windows, Mac OS, and UNIX. Linux is an operating system very much like UNIX, and it has become very popular over the last several years. Operating systems are computer programs. An operating system is the first piece of software that the computer executes when you turn the machine on. The operating system loads itself into memory and begins managing the resources available on the computer. It then provides those resources to other applications that the user wants to execute.

The shell- The shell acts as an interface between the user and the kernel. When a user logs in, the login program checks the username and password, and then starts another program called the shell. The shell is a command line interpreter (CLI). It interprets the commands the user types in and arranges for them to be carried out. The commands are themselves programs: when they terminate, the shell gives the user another prompt to let the user know that the program has finished ($ on our systems).

Useful LINUX shortcuts:Filename Completion - By typing part of the name of a command, filename or directory and pressing the [Tab] key, the shell will complete the rest of the name automatically. If the shell finds more than one name beginning with those letters you have typed, it will pause, prompting you to type a few more letters before pressing the tab key again.

History - The shell keeps a list of the commands you have typed in. If you need to repeat a command, use the cursor keys to scroll up and down the list or type “history” for a list of previous commands.

Files and ProcessesEverything in UNIX is either a file or a process.A process is an executing program identified by a unique process identifier. A file is a collection of data. They are created by users using text editors, running compilers etc.

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 5

Page 6: A Town on Fire - Susquehanna Universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · Web view- The shell acts as an interface between the user and the kernel. When a user

Examples of files: A document (report, essay etc.) The Centralia metagenomic sequences and related files A directory, containing information about its contents, which may be a mixture of other

directories (subdirectories) and ordinary files. For example, your Centralia_Case_Study folder is a directory.

It is not required to have a Linux operating system to use QIIME. We will be running the Linux environment through the Mac Terminal. So first let’s get started.

Team Application Activity: Practicing with LINUX and QIIME 2

Names of Team Members:

Part One: Getting your Files Ready.

Step One: Downloading the required files to your desktop. This has been done for you in 2018.

1. Get into pairs and get one Mac laptop per team of students. Create a new Desktop Folder entitled Centralia-se-single-sequences. Your folder must have exactly that name or subsequent commands in the case study will not work.

2. Go to the shared Dropbox folder and download the 4 sequence files (They will have names like C02-Rep1F.fastq.gz) and the sequence manifest (centralia-se-single-manifest) into your Centralia-se-single-Sequences folder. Download the metadata file (Centralia-se-single-metadata) onto your desktop, but NOT into the folder.

Part Two: Practicing with the LINUX environment worksheet

Instructions: Follow each of the steps below, and answer the questions in the spaces provided. Please note that spelling, spaces, cases, etc. are absolutely critical in LINUX. You can always type his (for history) to see what you typed if you get an error message…..

3. Double click on the Terminal program icon (located in Applications – Utilities) to open it. You should see your user name followed by a dollar sign (username $). Every time you see a $, it is a prompt that is telling you that it is ready for the next programming command.

4. Understanding the file structure and knowing how to use some basic Linux commands are essential for using QIIME effectively. Below is a simplified version of the file structure of an example distribution of Linux.

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 6

Page 7: A Town on Fire - Susquehanna Universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · Web view- The shell acts as an interface between the user and the kernel. When a user

5. The file structure is important when we use the command line, since we need to tell the shell where to find certain files, or where to output the results. The full path to qiime in this example would be /home/qiime.

Question One: In the space below, answer the following question: If you want to work in the Shared_Folder directory, what is the full path that you would type to get there?

Question 2: You can always determine which directory you are working in by typing pwd. Type it now. What do you see? That is your home directory. Write it down in the space below.

Question 3: If you want to list the contents of a directory, you use the command ls (list). Which files and directories do you see in your home directory? List a few.

6. In order to change directories, you will use the command cd. Navigate to the Desktop directory (it should have shown up when you typed ls) by typing in cd Desktop. The command line should now indicate that you are in the Desktop directory.

Question 4: List a few of the contents of your Desktop.

Question 5: What command did you just use to get that information?

Question 6: Go to the Centralia-se-single-sequences directory. What command did you use to do that?

Question 7: List the files that you see in the Centralia-se-single-sequences directory (note, they should match the file names of the files you downloaded from Dropbox!):

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 7

Page 8: A Town on Fire - Susquehanna Universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · Web view- The shell acts as an interface between the user and the kernel. When a user

Question 8: You can go up one level in your directories by typing cd .. (note there is a space between the cd and the two periods). Go ahead and do that. Which directory are you in now?

Question 9: cd back to the Centralia-se-single-sequences directory, this time trying the filename completion trick. Type cd Cen and then hit tab. The terminal should autofill the rest of your folder name. Hit return to change to that directory, then cd back to the Desktop.

Part Three: Sample Metadata

Sample Metadata: Metadata files contain all of the information we have about each of the samples, ranging from the name we have given it to its chemical composition. To view the metadata file:

1. Right-click (control + click) on the Centralia-se-single-metadata file on your desktop and choose Microsoft Excel to open it.

2. You will see a chart that begins with Sample ID (the sample name), and then provides a whole bunch of information about that sample.

Sample ID: A name that I have given to each borehole sample, in this case C02, C09, C10 and C13.

o Since three independent extractions were done for each sample, the name also includes the replicate number (ex: C02-Rep1.fastq). Because these file sizes are huge, doing the analyses on all three replicates would take several hours on a mainframe computer. We only have a few hours on laptops. So, we will only analyze one of the replicates.

o The ‘fastq’ identifier on refers to the sample’s file type. Fastq files include not only the sequence, but also an indication of the quality of that sequence (something called a Phred Score). We will use that quality score later to trim poor quality sequences.

Barcode Sequence: A short DNA sequence that is added to the 5’end of every PCR fragment in a sample. As you can see, every PCR fragment from the first replicate of C09 has the barcode sequence GCGATATATCGC at its 5’ end. This will let the QIIME program sort the sequences based on which sample site they came from in later analyses.

Soil temperature and chemical composition: Temperature and chemical analyses were performed on each of the borehole samples. If you look at sample C09, you will see that it came from a borehole that was 34.2°C and had 75 ppm sulfate.

Other stuff: Sample identifiers and information that are critical for my recordkeeping purposes, but that are not relevant to your analyses.

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 8

Page 9: A Town on Fire - Susquehanna Universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · Web view- The shell acts as an interface between the user and the kernel. When a user

Question 10. What is the barcode sequence for the C10?

Question 11. What is the ammonia content in ppm for C13?

Question 12. What was the temperature of C02?

Part Four: Importing Sequences into QIIME 2

1. Go back to the terminal and make sure you are in the Desktop directory.2. Next, type ‘source activate qiime2-2018.2’ (without the quotes). This will load the qiime program that we will use.

3. If you get the warning text type “yes” to tell the computer it is ok to proceed. If not, just proceed.

4. Next you may need to type in the password: qiime2 and hit return. Do not panic at this step when you do not see anything as you type. It is ok! If you are not asked for a password, just proceed.

5. In the next step, you must import your sequences into the QIIME environment. The command you will be using to do this is:

qiime tools import \ --type 'SampleData[SequencesWithQuality]' \ --input-path Centralia-se-single-sequences/centralia-se-single-manifest.csv \ --output-path Centralia-single-end-demux.qza \ --source-format SingleEndFastqManifestPhred33

But what does all of that mean? The first line tells the computer which program you will be using. In this case, qiime tools import. Next, you must tell the program where (and what) the sequences are that will be imported (input-path), where you want the final imported sequences to be located, and what to call them (output-path), and the format of the sequences you will be importing. In this case they are:

SingleEnd: single end sequences…only the forward primer was used to generate the sequence FastqManifestPhred33: they have the quality data included with the sequences, and the Centralia-se-single-manifest file

is used by the program to locate the raw sequence files, e.g. C02-Rep1F.fastq.gz. The 33 refers to the type of Phred data generated by our particular sequencer.

6. Copy and paste the command above (starting with qiime and ending with 33) into your terminal window and hit return.

7. When the program has finished running, you will see a $ sign again.

Question 13. List the contents of your desktop now. What is the name of the new file that was created by qiime tools import?

Question 14. Look at the output-path line of the command you typed. Is this the filename that you expected to see?

Part Five: Using QIIME to quality trim the sequence files and to Remove Erroneous Sequences.Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 9

Page 10: A Town on Fire - Susquehanna Universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · Web view- The shell acts as an interface between the user and the kernel. When a user

Sample Quality: Sample quality is measured by something called a Phred score (Q) that is reported for each of the bases in a sequence. These scores are included in your raw sequence (.fastq) files.

The formula for Q is shown below (P is the probability of a base calling error):

So, a Phred score of 10 means that there is a 1/10 chance of an incorrect base call (90% accuracy) and a Phred score of 40 means that there is a 1/10,000 chance of an incorrect base call. If a sequence does not have adequate quality scores QIIME will filter it out of the downstream analyses. Phred scores of 30 are typically used to filter sequences in the analyses that follow.

Question 15. If you were to go ahead to use sequences with low Phred scores, how do you think that would impact your final phylogenetic analysis?

8. So how do we use this information to make sure that we only use high quality sequences? We will use the qiime quality-filter q-score program. In this analysis, qiime will analyze each of your sequences and filter out any that are of insufficient quality, according to Phred scores.

The flow chart for this program is as seen to the left, as described by Bokulich et al. (2013).

First, each base of a sequence is analyzed for it’s q value. If the value is equal to or above the cutoff value, then the next base in the sequence will be analyzed. This continues until the last high quality base is reached, and all bases after that point are cut off.

Next, the sequence is analyzed to see how long it still is (p). For example, if a sequence is now only 10 bases long, it will not make much sense to keep using it. The cutoff value for p is usually 75% of the full-length sequence.

In the remaining sequence, a maximum number of ambiguous (N) calls (n) and lower quality calls (r) are then used and any sequences that have too many of either are excluded from the analysis.

Finally, extremely rare sequences are removed, as they are assumed to be artifacts.

To quality filter your imported sequences (from the Centralia-single-end-demux file), you would copy and paste the command below into your terminal window, but DO NOT DO THIS STEP. It will take approximately 3 hours to run. The final output files were included in the items you downloaded from Dropbox.

qiime quality-filter q-score \ --i-demux Centralia-single-end-demux.qza \ --o-filtered-sequences demux-filtered.qza \

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 10

Page 11: A Town on Fire - Susquehanna Universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · Web view- The shell acts as an interface between the user and the kernel. When a user

--o-filter-stats demux-filter-stats.qza

9. The previous step removes identifiably poor-quality sequences based on Phred scores, but it does not remove high quality sequences that, nevertheless, contain errors. It turns out that these errors are relatively common in PCR-generated sequences because 1) some of the DNA polymerases used do not have a proof-reading ability, and, 2) even when they do, the millions of replication events in a typical PCR reaction virtually guarantee that some errors will be present in the final sequences.

Question 16. What would happen to the validity of your phylogenetic analyses if you used DNA sequences with errors in them to identify bacterial community members?

10. The qiime deblur denoise program identifies and removes sequences that contain common PCR artifacts as well as those that differ from the true biological sequences from which they were derived. Before the program can be run, all of the sequences in the sample must be trimmed to the same length. Sequence quality tends to go down at the end of a sequencing read, as seen in the figure below, so we will choose a trim length that maximizes sequence quality without making the sequences too short.

Question 17. Based on the figure to the left, the trim length was set at 120. Explain why.

To perform qiime deblur denoise you would type the script below. Again DO NOT DO THIS – it will take hours to run. As before, the output files for this analysis are included in the items you downloaded from Dropbox.

qiime deblur denoise-16S \ --i-demultiplexed-seqs demux-filtered.qza \ --p-trim-length 120 \ --o-representative-sequences rep-seqs-deblur.qza \ --o-table table-deblur.qza \ --p-sample-stats \ --o-stats deblur-stats.qza

11. You will now need to change the file names of two of your files so that future analyses will work. After running the following two scripts, you should see the rep-seqs.gza and table.gza files on your desktop.

mv rep-seqs-deblur.qza rep-seqs.qza

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 11

Page 12: A Town on Fire - Susquehanna Universitycomenius.susqu.edu/biol/312/centraliametagenomicsc… · Web view- The shell acts as an interface between the user and the kernel. When a user

mv table-deblur.qza table.qza

12. Ultimately, you need to know which sequences came from which sampling site, so you will create a feature table that includes the metadata information. The output file will be able to be visualized (it has a qsv file type) in the next class period using qiime tools view. The script to use is:

qiime feature-table summarize \ --i-table table.qza \ --o-visualization table.qzv \ --m-sample-metadata-file Centralia-se-single-metadata.txt

13. Congratulations: you have just completed the first step of this QIIME workflow! During the next class you will do the next part of the analyses.

Question 18. In the meantime, you have one more task for today. As a team, choose the species you want to look for in Centralia’s soils. Write the Genus and species of that microbe here:

Metagenomic Analysis of Bacterial Communities in Soils Overlying the Centralia, Pennsylvania Mine Fire 12