polymorphism discovery in next- generation re-sequencing data gabor t. marth boston college biology...
Post on 20-Dec-2015
217 views
TRANSCRIPT
Polymorphism discovery in next-generation re-sequencing data
Gabor T. MarthBoston College Biology Department
Illumina workshop, Washington, DCNovember 19-20, 2007
Why we care about genetic variations?
underlie phenotypic differences
cause inherited diseases
allow tracking ancestral human history
Variation types
Structural variations
SNPs
Mikkelsen et al. Nature 2007
Epigenetic variations
Sequence resources for polymorphism discovery
read length
base
s p
er
mach
ine r
un
10 bp 1,000 bp100 bp
100 Mb
10 Mb
1Mb
1Gb
Illumina/Solexa, AB/SOLiD short-read sequencers
ABI capillary sequencer
454 pyrosequencer(100 Mb in ~250 bp reads)
(1-4 Gb in 25-50 bp reads)
Resequencing-based SNP discovery
(iv) read assembly
REF
(iii) read mapping (pair-wise alignment to genome reference)
IND
(i) base calling
IND
(v) SNP calling
(vi) SNP validation
(ii) micro-repeat analysis
(vii) data viewing, hypothesis generation
Talk topics
Tools for resequencing read analysis
Data mining projects
• base calling• resequenceability analysis• read mapping / alignment / assembly• SNP calling• structural variation discovery• read data visualization
• SNP and short-INDEL discovery in C. elegans• Complete mutational profiling in Pichia stipitis
…AND the cover on the box
Reference-guided read alignment
Reference-sequence guided assembly:
…they give you the pieces…
Some pieces are easier to place than others…
…pieces with unique features
pieces that look like each other…
Resequenceability: unique read placement
• Reads from repeats cannot be uniquely mapped
• RepeatMasker does not capture all repeats at the read length scale
• Near-perfect repeats can be also a problem because of sequencing errors and / or SNPs
fract
ion
of
read
s
number of mismatches
Finding micro-repeats is not easy
• Hash based methods (fast but only work out to a couple of mismatches)• Exact methods (very slow but find every repeat copy)• Heuristic methods (fast but miss a fraction of the repeats)
Presenting repeats for downstream analysis
masking bases
masking fragments
• bases in repetitive fragments may be resequenced with reads representing other, unique fragments fragment-level repeat annotations spare a higher fraction of the genome than base-level repeat masking
Fragment level annotation is economical
0 1 20.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Fra
ctio
n o
f ge
no
me
Number of mismatches allowed
Paired-end reads will not make the question go away
Read alignment
Read alignment
INDELs require gapped alignment
ABI/cap.
454/FLX
Illumina
454/GS20
sequences, often from different machine types, must be assembled together
billions of sequences must be aligned
MOSAIK: an anchored aligner / assembler
Step 1. initial short-hash scan for possible read locations
Step 2. evaluation of candidate locations with SW method
Michael Stromberg
MOSAIK – performance
• Solexa read alignments to C. elegans genome:100 million reads aligned in 95 minutes18,000 reads / second
• 454 reads to Pichia (yeast-size) genomeGS20: 2,000 reads / secondFLX: 300 reads / second
• Solexa read alignments to masked human genome:40 seconds for 1 million reads 18,000 reads / second5.5 GB RAM used (more for longer initial hash sizes)
Polymorphism detection
• Goal: to discern true variation from sequencing error
sequencing error
polymorphism
Using base quality values
• use base quality values to help us decide if mismatches are true polymorphisms or sequencing errors
Bayesian detection algorithm
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11
A
A
A
A
A
C
C
C
C
C
T
T
T
T
T
G
G
G
G
G
polymorphic combination
monomorphic combinationBayesian
posterior probability i.e. the SNP score
Base call + Base quality Expected polymorphism rate
Base composition Depth of coverage
The PolyBayes software
Marth et al. Nature Genetics 1999http://bioinformatics.bc.edu/~marth/PolyBayes
Data visualization
1. aid software development: integration of trace data viewing, fast navigation, zooming/panning
2. facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays
3. promote hypothesis generation: integration of annotation tracks
Weichun Huang
SNP calling in short-read coverage
C. elegans reference genome (Bristol, N2 strain)
Pasadena, CB4858(1 ½ machine runs)
Bristol, N2 strain(3 ½ machine runs)
• 5 runs (~120 million) Illumina reads from Wash. U. (Elaine Mardis)• detect polymorphisms between the Pasadena and the Bristol strain• aligned / assembled the reads (< 4 hours on 1 CPU)• found 44,642 SNP candidates (2 hours on our 160-CPU cluster) • SNP density: 1 in 1,630 bp (of non-repeat genome sequence)
Polymorphism discovery in C. elegans
• SNP calling error rate very low:
Validation rate = 97.8% (224/229)Conversion rate = 92.6% (224/242)Missed SNP rate = 3.75% (26/693)
SNP
INS
• INDEL candidates validate and convert at similar rates to SNPs:
Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)
Mutational profiling: deep 454/Illumina data
• collaboration with Doug Smith at Agencourt
• Pichia stipitis converts xylose to ethanol (bio-fuel production)
• one mutagenized strain had especially high conversion
efficiency
• determine where the mutations were that caused this
phenotype
• we resequenced the 15MB genome with 454 Illumina, and
SOLiD reads
Pichia stipitis reference sequence
Image from JGI web site
Mutational profiling: comparisons
Technology Coverage Nominal coverage FP FN Total error
454/FLX 2 runs 12.9x 1 0 1
454/FLX 1 run 9.8x 6 1 7
Illumina 7 lanes 53.5x 0 0 0
Illumina 3 lanes 23.4x 0 0 0
Illumina 2 lanes 15.6x 2 0 2
Illumina 1 lane 7.6x 2 2 2
SOLiD - 30.0X 0 0 0
SOLiD - 20.0X 0 0 0
SOLiD - 10.0X 0 0 0
SOLiD - 8.0X 0 4 4
SOLiD - 6.0X 0 6 6
Our software is available for testing
http://bioinformatics.bc.edu/marthlab/Beta_Release
Credits
http://bioinformatics.bc.edu/marthlab
Elaine Mardis (Washington University)Doug Smith (Agencourt)
Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.)
Derek BarnettEric Tsung
Aaron QuinlanDamien Croteau-Chonka
Weichun Huang
Michael Stromberg
Chip Stewart
Michele Busby
Resequencing of diploid individual genomes
Ind. 1
Ind. 2
Ind. 3
Ind. 4
How do we find sequence variations?
• compare multiple sequences from the same genome region
Resequencing applications of next-gen sequencers
Emerging applications:• DNA-protein interaction analysis (CHiP-Seq)• epigenetic analysis (methylation profiling)• novel transcript discovery• quantification of gene expression
Polymorphism discovery: • organismal SNP discovery• complete mutational profiling• individual human resequencing for SNP, INDEL and structural variation discovery
DELSNP
reference genomeresequenced individual
Task 5. Dealing with massive data volumes
Short-read format working [email protected](Asim Siddiqui, UBC)
Assembly format working group
Boston Collegehttp://assembly.bc.edu
• two connected working groups to define standard data formats
SNP calling in low 454 coverage
• with Andy Clark (Cornell) and Elaine Mardis (Wash. U.)• 10 different African and American melanogaster isolates• 10 runs of 454 reads (~300,000 reads per isolate) (~1.5X total)• can we detect SNPs in survey-style 454 read coverage?
DNA courtesy of Chuck Langley, UC Davis
• base-calling with PYROBAYES • alignment to 120 Mb euchromatic reference sequence with MOSAIK • SNP detection with POLYBAYES
SNP calling results
iso-1 reference
46-2 454 read
46-2 ABI reads (2 fwd + 2 rev)
• 92.9 % validation rate (1,342 / 1,443)• 2.0% missed SNP rate (25 / 1247)
• 658,280 SNPs
• Ѳ ≈ 5x10-3 (1 SNP / 200 bp)
Flow signal vs. actual base number
Reference-guided read alignment
PYROBAYES: A 454 base caller program
• better correlation between assigned and measured quality values
• higher fraction of high-quality bases
Aaron Quinlan
454 errors: over and under-calls
Validation / score calibration
0
20
40
60
80
100
51-60 61-70 71-80 81-90 91-100
SNP score [%]
Co
nfi
rmat
ion
rat
e [
%]
Traditional SNP discovery data
capillary sequences(ABI)
clonal (haploid) sequences
The SNP score
polymorphism
specific variation
SNPs and short INDELs
Single-base substitutions (SNPs)
Insertion-deletion polymorphisms (INDELs)
Structural variations
Translocations: DNA exchange between different chromosomes
Inversion of long chromosomal tracts
“Simple” duplications and deletions
Multiple duplications (copy number
changes)
Epigenetic variations
Epigenetic variations e.g. changes in methylation / chromatin structure that do not strictly involve base changes
Mikkelsen et al. Nature 2007
Task 1. Base calling / base accuracy estimation
• how do we translate the machine readouts to base calls?• how do we estimate and represent sequencing errors (base quality values)?
454 pyrosequencing errors
454 pyrosequencer error profile
INDEL errors dominate
454 base quality values
• most bases have low quality values, not optimal for SNP discovery
• native 454 base quality values underestimate true accuracy
Illumina/Solexa base accuracy
• Most errors are substitutions PHRED quality values work
• Measured base quality is a function of base position within the read (i.e. there is need for quality value calibration)
Illumina/Solexa base accuracy
• error rate grows as a function of base position within the read
• a large fraction of the reads contains 1 or 2 errors
Task 2. Read mapping and assembly
… is similar to a jigsaw puzzle…
… that you have to put together all by yourself
De novo assembly:
Structural variation discovery
• copy number variations (deletions & amplifications) can be detected from variations in the depth of read coverage
• structural rearrangements (inversions and translocations) require paired-end reads
Task 4. Data visualization
make screenshot with annotation
Applications
2. Mutational profiling in deep 454 and Illumina read data(Pichia stipitis)
1. SNP and INDEL discovery in deep Illumina short-read coverage(Caenorhabditis elegans)
(image from Nature Biotech.)