intro to next generation sequencing
DESCRIPTION
Intro to Next Generation Sequencing. Nick Loman and James Hadfield. http:// omicsmaps.com /. Koboldt et al., 2010 (Figure 3). Bench work to build libraries and sequence. Clean up and QA reads. Alignments to Genome or Transcriptome. Analysis of Alignments. Koboldt et al., 2010. - PowerPoint PPT PresentationTRANSCRIPT
DM Church Last Updated: 7 May 2012
Intro to Next Generation Sequencing
DM Church Last Updated: 7 May 2012
http://omicsmaps.com/ Nick Loman and James Hadfield
DM Church Last Updated: 7 May 2012
DM Church Last Updated: 7 May 2012
Koboldt et al., 2010 (Figure 3)
DM Church Last Updated: 7 May 2012
DM Church Last Updated: 7 May 2012
Bench work to build libraries and
sequence
Clean up and QA reads
Alignments to Genome or
Transcriptome
Analysis of Alignments
DM Church Last Updated: 7 May 2012
Koboldt et al., 2010
Sample Contamination
Library chimeras
Sample mix-upsTumor-normal
switches
Run quality
DM Church Last Updated: 7 May 2012
Koboldt et al, (Fig 4A)
DM Church Last Updated: 7 May 2012
DM Church Last Updated: 7 May 2012
Chor et al., 2009
DM Church Last Updated: 7 May 2012
CCL Bio
DM Church Last Updated: 7 May 2012
GCTACGGCATTCAGGCATCAGGCATTAGCAGGGCATTCAGGGATCAGGCATTAGC->
<-CATGGCATTCAGGGATCAGGCATT<-GCCATGGCATTCAGGGATCAGGC
CATTCAGGGATCAGGCATTAGCAG->
GGCATTCAGGGATCAGGCATTAGC->CATTCAGGGATCAGGCATTAGCAG->
GGCATTCAGGGATCAGGCATT-><-GGATCAGGCATTAGCAG<-GATCAGGCATTAGCAG<-GGATCAGGCATTAGCAG
DM Church Last Updated: 7 May 2012
High Coverage: qualities may not be needed
DM Church Last Updated: 7 May 2012
Low Coverage: qualities are important
DM Church Last Updated: 7 May 2012
Custodia-Lora et al., 2003
DM Church Last Updated: 7 May 2012
FASTQ Example
FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.
For analysis, it may be necessary to convert to the Sanger form of FASTQ…For example,
Illumina stores quality scores ranging from 0-62;Sanger quality scores range from 0-93.
Solexa quality scores have to be converted to PHRED quality scores.
DM Church Last Updated: 7 May 2012
SAM (Sequence Alignment/Map)
• It may not be necessary to align reads from scratch…you can instead use existing alignments in SAM format– SAM is the output of aligners that map reads to a
reference genome– Tab delimited w/ header section and alignment
section• Header sections begin with @ (are optional)• Alignment section has 11 mandatory fields
– BAM is the binary format of SAM
http://samtools.sourceforge.net/
DM Church Last Updated: 7 May 2012http://samtools.sourceforge.net/SAM1.pdf
Mandatory Alignment Fields
DM Church Last Updated: 7 May 2012http://samtools.sourceforge.net/SAM1.pdf
Alignment Examples
Alignments in SAM format
DM Church Last Updated: 7 May 2012
chr1 86114265 86116346 nsv433165chr2 1841774 1846089 nsv433166chr16 2950446 2955264 nsv433167chr17 14350387 14351933 nsv433168chr17 32831694 32832761 nsv433169chr17 32831694 32832761 nsv433170chr18 61880550 61881930 nsv433171
chr1 16759829 16778548 chr1:21667704 270866 -chr1 16763194 16784844 chr1:146691804 407277 +chr1 16763194 16784844 chr1:144004664 408925 -chr1 16763194 16779513 chr1:142857141 291416 -chr1 16763194 16779513 chr1:143522082 293473 -chr1 16763194 16778548 chr1:146844175 284555 -chr1 16763194 16778548 chr1:147006260 284948 -chr1 16763411 16784844 chr1:144747517 405362 +
Valid BED files
DM Church Last Updated: 7 May 2012
GTF
DM Church Last Updated: 7 May 2012
##gff-version 3##gvf-version 1.02##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10090##genome-build NCBI MGSCv36##assembly-name MGSCv36##assembly-accession GCF_000001635.15##file-date 2011-11-18# Study_accession: Combined studies on MGSCv36# Display_name: Combined studies on MGSCv36# Study_description: Combined studies on MGSCv36chr1 dbVar copy_number_variation 90044442 90114410 . . .
ID=nsv433533;Name=nsv433533;Start_range=.,90044442;End_range=90114410,.chr4 dbVar copy_number_variation 121483931 121646639 .
. .ID=nsv433534;Name=nsv433534;Start_range=.,121483931;End_range=121646639,.chr9 dbVar copy_number_variation 109128634 109146964 .
. .ID=nsv433535;Name=nsv433535;Start_range=.,109128634;End_range=109146964,.chr17 dbVar copy_number_variation 30240627 30614866 . . .
ID=nsv433536;Name=nsv433536;Start_range=.,30240627;End_range=30614866,.chr17 dbVar copy_number_variation 30983722 31036099 . . .
ID=nsv433537;Name=nsv433537;Start_range=.,30983722;End_range=31036099,.chr17 dbVar copy_number_variation 34907088 34962504 . . .
ID=nsv433538;Name=nsv433538;Start_range=.,34907088;End_range=34962504,.
GVF format
DM Church Last Updated: 7 May 2012
http://www.ncbi.nlm.nih.gov/dbvar
http://www.ebi.uk/dgva
http://www.ncbi.nlm.nih.gov/snp
Derived data
DM Church Last Updated: 7 May 2012
Derived data
DM Church Last Updated: 7 May 2012
Actual data
DM Church Last Updated: 7 May 2012Oct-00 Feb-02 Jun-03 Nov-04 Mar-06 Aug-07 Dec-08 May-10 Sep-11
100000000
1000000000
10000000000
100000000000
1000000000000
10000000000000
100000000000000
1000000000000000 Trace and SRA Holdings
TraceArchive Bases
SRA Bases
SRA Bytes
Getting exponential growth under control
DM Church Last Updated: 7 May 2012
Trace Organizationseq1
seq2
FASTAQualityChromatogramExperimental infoSample
FASTAQualityChromatogramExperimental infoSample
SRA Organization
Experiments
Samples
Sequences and Qualities
DM Church Last Updated: 7 May 2012Feb-08 Sep-08 Mar-09 Oct-09 May-10 Nov-10 Jun-11 Dec-110
1
2
3
4
5
6
7
8
9
10
Bytes per base in SRA
CummulitiveIncrementalMoving Av-erage
Era of NGS Explosion FASTQ Era Bits/Base Era
As of April 10, 2012 SRA contains less bytes then bases
DM Church Last Updated: 7 May 2012
New CycleDecision Circle
What data series to
store
Redundancy removal
Normalization
Lossy vs Lossless
Compression tuning
Practical Application
BAM and similar formats containing both raw
reads and alignments become primary output
of raw sequencing
Increases the number of data
series
Compression By Reference
reduces sizes of other data series
New sets of tradeoffs
New compression algorithms
DM Church Last Updated: 7 May 2012
Analyzing New Compression MethodData from 1000 Genome Project
• All available combinations of samples, platforms, and aligners
• 3114 files• 27 Tb of disk space after compression
BAMs from 1000 Genome Project
• Names are dropped after restoring mates• Only sequencing quality score is saved• None of non-redundant optional tags are preserved
BAM treatment
• Occasional alignments to stretches of Ns on the reference and beyond the reference were converted to unaligned
• Different PCR duplicate flags for mates
Correction of BAM
inconsistencies
DM Church Last Updated: 7 May 2012
Changes To SRA Run Browser
DM Church Last Updated: 7 May 2012
http://aws.amazon.com/datasets/4383
DM Church Last Updated: 7 May 2012
https://main.g2.bx.psu.edu/
DM Church Last Updated: 7 May 2012
http://www.genomespace.org/
DM Church Last Updated: 7 May 2012
Science 1 July 2011:Vol. 333 no. 6038 pp. 53-58DOI: 10.1126/science.1207018
DM Church Last Updated: 7 May 2012
Li et al., 2011, Figure 1
DM Church Last Updated: 7 May 2012
Li et al., 2011Fig. 2
DM Church Last Updated: 7 May 2012
Kleinman et al., 2012Fig 1
DM Church Last Updated: 7 May 2012
Kleinman et al., 2012Table 1
DM Church Last Updated: 7 May 2012
Lin et al., 2012Fig 1
DM Church Last Updated: 7 May 2012
Lin et al., 2012Fig 2
DM Church Last Updated: 7 May 2012
Pickrell et al., 2012Fig 1
DM Church Last Updated: 7 May 2012
Li et al, 2012Fig 1
DM Church Last Updated: 7 May 2012
Li et al., 2012Fig 2
DM Church Last Updated: 7 May 2012
Li et al., 2012Fig 3
DM Church Last Updated: 7 May 2012
Li et al, 2012Fig 4