practical course in genome...

33
Practical Course in Genome Bioinformatics Day 1 - Friday 20 th January 2017

Upload: others

Post on 23-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Practical Course in Genome Bioinformatics

Day 1 - Friday 20th January 2017

Page 2: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Course Introduction

• Practical course in genome bioinformatics

• 5 credits

[email protected]

• Presents a genome project of a real biological organism with an emphasis on the practical aspects of the project

• Grading is based on 7 work reports returned after each of the course days

• http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2017/

Page 3: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

What is involved in a Genome Project?

• After experimental design, library construction and other wet lab preparations, from a bioinformatics perspective, a genome project involves:

• Sequencing and sequencing platforms

• De novo Assembly

• RNA-sequencing and Mapping

• Ab initio Gene Prediction

• Protein annotation

• Submission and publication of genome in database

• Further downstream analysis

Page 4: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits
Page 5: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Pre-history of DNA sequencing

• 1866: Haeckel suggests that nucleus somehow responsible for transmission of heritable traits

• 1869: DNA, then called "nuclein", first isolated by Friedrich Miescher (largely ignored)

• 1870: Flemming identified chromosomes (coins terms "chromatin" and "mitosis").

• 1880-90s: Bovari suggests that chromosomes contain genetic material and different chromosomes contain different heritable traits

• Interim: Scientific consensus is that proteins contain genetic, heritable information

• 1944: Work by Avery, MacLeod, and McCarty demonstrate DNA to be fundamental to heredity

• 1953: Watson and Crick discover the structure of DNA using Franklin and Wilkin's X-ray crystallography research

• 1955: Protein sequence for Insulin determined (Sanger)

Page 6: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Pre-history of DNA sequencing

• Early 1970s: Enterobacteria phage λ possessed 5′ overhanging ‘cohesive’ ends, DNA polymerase used to fill in the ends with radioactive nucleotides

• Later generalised, but limited to short sequences, very tedious (2D fractionation, analytical chemistry)

• Mid 1970s: Single separation by polynucleotide length using electrophoresis through polyacrylamide gels (Coulson and Sanger 1975, Maxam and Gilbert 1977)

• Sanger and colleagues sequenced the first DNA genome (bacteriophage φX174) (Sanger et al. 1977) - complex, but widely adopted

(5,386 bp)

Page 7: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

DNA Sequencing

• We refer to DNA sequencing methods as belonging to three generations

• First generation 1977: chain-termination ("Sanger") sequencing, (monoclonal, later PCR)

• Second generation 2005: parallel sequencing (high-throughput, next-generation)

• Third generation 2010/11: single molecule sequencing

Sanger and Gilbert Nobel prize in Chemistry 1980

Page 8: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Sanger Sequencing

• Read lengths ~900 bp

• Very high quality (used to verify NGS results)

Page 9: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Sanger Sequencing: Improvements

• Improvements to first generation sequencing enabled the process to be automated, for example:

• Phospho- or tritium-radiolabelling replaced with fluorometric based detection (only 1 lane required vs. 4)

• Capillary-based electrophoresis for improved detection

• Development of PCR (Mullis)

• Automation enabled commercial development and makes shotgun sequencing practical; sequencing overlapping fragments to assemble into longer contiguous sequences (contigs)

Page 10: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

• Refer to genome sequencing methods as belonging to three generations

• First generation 1977: chain-termination ("Sanger") sequencing, (monoclonal, later PCR)

• Second generation 2005: massively parallel sequencing (high-throughput, next-generation)

• Third generation 2010/11: single molecule sequencing

DNA Sequencing

Page 11: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Massively Parallel Sequencing

• "Sequencing-by-synthesis" (Sanger is also SBS, both require DNA polymerase to produce the observable output)

• Amplification of DNA by bridging PCR

• Detection with CCD camera

• Massive number of reads per run (MiSeq 20M, NextSeq 1G ...)

Page 12: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Sequencing-by-synthesis

• Prepare sample

• Randomly fragment genomic DNA

• Ligate adapters to both ends of the fragments

Adapted from Illumina Sequencing Technology documentation

Page 13: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Sequencing-by-synthesis

• Attach DNA to surface

• Bind single-stranded fragments randomly to the inside surface of the flow cell channels

Page 14: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Sequencing-by-synthesis

• Bridge amplification

• Add unlabeled nucleotides and enzyme to initiate solid-phase bridge amplification

Page 15: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Sequencing-by-synthesis

• Fragments become double stranded

• The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate

Page 16: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Sequencing-by-synthesis

• Denature double-standed molecules

• Denaturation leaves single-stranded templates anchored to the substrate

Page 17: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Sequencing-by-synthesis

• Complete amplification

• Several million dense clusters of double-stranded DNA are generated in each channel of the flow cell

Page 18: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Sequencing-by-synthesis

• Determine base

• Sequencing cycle begins by adding four labeled reversible terminators, primers, and DNA polymerase

Page 19: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Sequencing-by-synthesis

• Image base

• After laser excitation, the emitted fluorescence from each cluster is captured and the base is identified

Page 20: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

• Refer to genome sequencing methods as belonging to three generations

• First generation 1977: chain-termination ("Sanger") sequencing, (monoclonal, later PCR)

• Second generation 2005: massively parallel sequencing (high-throughput, next-generation)

• Third generation 2010/11: single molecule sequencing

DNA Sequencing

Page 21: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Single Molecule Sequencing

• Far longer read lengths (up to multi-hundred Kb!)

• Eavesdrops on DNA polymerase molecule contained in Zero-mode waveguides (ZMW)

• Measures current in nanopore to determine current basepair or kmer

Page 22: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Commercial Sequencing Platforms

• As of 2017, there are several options: Illumina, PacBio, IonTorrent, Oxford Nanopore, Roche 454 (obsolete, but still around)

• Important metrics from bioinformatics perspective:

• Average read length (basepairs)

• Total sequence output (basepairs per run)

• Error profile (average accuracy and platform-specific biases)

Page 23: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

NGS Platforms: Illumina

• Read lengths (100 - 300 bp) and total sequence per run dependent on platform

• Error rate: <1%

• Errors tend to be substitutions, biased towards 3' end

Page 24: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits
Page 25: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Illumina MiSeq Error Profile

Schirmer et al. Nucl. Acids Res. 2015

Page 26: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

NGS Platforms: Pacific Biosciences

• Variable length long reads

• Error rate: 11-15%(CLR) or <1% (CCS)

• Errors stochastically distributed

PacBio RS II read length distribution using P6-C4 chemistry

Rhoads and Au, Genomics, Proteomics & Bioinformatics, 2015

Page 27: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

NGS Platforms: Others

• IonTorrent

• Roche 454 and SOLiD

• Oxford Nanopore

~400 bp single-end reads 80 M reads, 98% acc. homopolymer errors 700 bp single-end reads

1 M reads, 99% acc, homopolymer errors

50+(35/50) bp paired-end reads ~1.4 M reads, 99.9% acc palindrome errors, AT bias

up to 200 Kb reads ~1.5 Gb sequence, 88% acc.

indel errors

Page 28: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Gigabases per run vs. Read length

Page 29: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Genomic Data Visualisation

• Why is visualisation important?

• Provides an overview, makes it easier to spot errors

• Communicates work to collaborators

• Publication figures

• Crucial at all steps of a project!

Page 30: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

DNAplotter - genome visualisation

DNAPlotter: circular and linear interactive genome visualizationCarver et al. Bioinformatics, 2008

Page 31: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Mauve - multiple genome alignment

Mauve: Multiple Alignment of Conserved Genomic Sequence With RearrangementsDarling et al. Genome Research, 2004

Page 32: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

UCSC Genome Browser

The human genome browser at UCSCKent et al. Genome Research, 2002

Page 33: Practical Course in Genome Bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/spring... · 2017. 11. 2. · • Practical course in genome bioinformatics • 5 credits

Computer Exercises

• Today:

• Visualising a bacterial genome with DNAplotter

• Aligning multiple E.coli genomes with Mauve

• Using UCSC Genome Browser utilities

• Exercises: http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2017/Exercises_day1.pdf