informatics for next-generation sequence analysis – snp calling
DESCRIPTION
Informatics for next-generation sequence analysis – SNP calling. Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008. Read length and throughput. Illumina/Solexa, AB/SOLiD short-read sequencers. 1Gb. (1-4 Gb in 25-50 bp reads). bases per machine run. 100 Mb. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/1.jpg)
Informatics for next-generation sequence analysis – SNP calling
Gabor T. MarthBoston College Biology Department
PSB 2008January 4-8. 2008
![Page 2: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/2.jpg)
Read length and throughput
read length
base
s p
er
mach
ine r
un
10 bp 1,000 bp100 bp
100 Mb
10 Mb
1Mb
1Gb
Illumina/Solexa, AB/SOLiD short-read sequencers
ABI capillary sequencer
454 pyrosequencer(20-100 Mb in 100-250 bp reads)
(1-4 Gb in 25-50 bp reads)
![Page 3: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/3.jpg)
Current and future application areas
• Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery
• De novo genome sequencing
• Short-read sequencing will be (at least) an alternative to micro-arrays for:
• DNA-protein interaction analysis (CHiP-Seq)• novel transcript discovery• quantification of gene expression• epigenetic analysis (methylation profiling)
DELSNP
reference genome
![Page 4: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/4.jpg)
Fundamental informatics challenges (I)
1. Interpreting machine readouts – base calling, base error estimation
2. Dealing with non-uniqueness in the genome: resequenceability
3. Alignment of billions of reads
![Page 5: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/5.jpg)
Informatics challenges (II)
5. Data visualization
4. SNP and short INDEL, and structural variation discovery
6. Data storage & management
![Page 6: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/6.jpg)
Resequencing-based SNP discovery
genome reference sequence
Read mapping
Read alignment
Paralog identification
SNP detection + inspection
![Page 7: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/7.jpg)
SNP calling workflow
• read alignment
• SNP detection
• visual checking
![Page 8: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/8.jpg)
Bayesian detection algorithm
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11
A
A
A
A
A
C
C
C
C
C
T
T
T
T
T
G
G
G
G
G
polymorphic combination
monomorphic combinationBayesian
posterior probability i.e. the SNP score
Base call + Base quality Polymorphism rate (prior)
Base composition Depth of coverage
![Page 9: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/9.jpg)
Base quality values for SNP calling
• base quality values help us decide if mismatches are true polymorphisms or sequencing errors• accurate base qualities are crucial, especially in lower coverage
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11
![Page 10: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/10.jpg)
Priors for specific resequencing scenarios
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11
AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA
AACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA
strain 1
strain 2
strain 3
AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATA
individual 1
individual 3
individual 2
![Page 11: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/11.jpg)
Consensus sequence generation (genotyping)
AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA
AACGTTAGCATAAACGTTAGCATAAACGTTAGCATA
strain 1
strain 2
strain 3
AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATA
individual 1
individual 3
individual 2
AACGTTCGCATAAACGTTCGCATA
A
C
A
A/C
C/C
A/A
![Page 12: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/12.jpg)
SNP calling in Roche/454 pyrosequences
![Page 13: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/13.jpg)
SNP calling in low 454 coverage
• with Andy Clark (Cornell) and Elaine Mardis (Wash. U.)• 10 different African and American melanogaster isolates• 10 runs of 454 reads (~300,000 reads per isolate) (~1.5X total)• can we detect SNPs in survey-style 454 read coverage?
DNA courtesy of Chuck Langley, UC Davis
iso-1 reference
46-2 454 read
46-2 ABI reads (2 fwd + 2 rev)
• 92.9 % validation rate (1,342 / 1,443)• 2.0% missed SNP rate (25 / 1247)
![Page 14: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/14.jpg)
SNP calling in Illumina/Solexa short-reads
![Page 15: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/15.jpg)
SNP calling in short-read coverage
C. elegans reference genome (Bristol, N2 strain)
Pasadena, CB4858(1 ½ machine runs)
• SNP calling error rate very low:
Validation rate = 97.8% (224/229)Conversion rate = 92.6% (224/242)Missed SNP rate = 3.75% (26/693)
SNP
INS
• INDEL candidates validate and convert at similar rates to SNPs:
Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)
![Page 16: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/16.jpg)
A C G G T C G T C G T G T G C G T
A C G G T C G T C G T G T G C G T
A C G G T C G C C G T G T G C G T
A C G G T C G T C G T G T G C G T
No change
SNP
Measurementerror
SNP calling in AB/SOLiD color-space reads
![Page 17: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/17.jpg)
Mutational profiling: deep 454/Illumina/SOLiD data
• collaboration with Doug Smith at Agencourt
• Pichia stipitis converts xylose to ethanol (bio-fuel production)
• one mutagenized strain had especially high conversion
efficiency
• determine where the mutations were that caused this
phenotype
• we resequenced the 15MB genome with 454 Illumina, and
SOLiD reads
• 14 true point mutations in the entire genome
• In about 15X nominal coverage each technology can find
every point mutation with essentially no false positives
Pichia stipitis reference sequence
Image from JGI web site
![Page 18: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/18.jpg)
Our software is available for testing
http://bioinformatics.bc.edu/marthlab/Beta_Release
![Page 19: Informatics for next-generation sequence analysis – SNP calling](https://reader035.vdocuments.site/reader035/viewer/2022062519/56814d79550346895dbad921/html5/thumbnails/19.jpg)
Credits
http://bioinformatics.bc.edu/marthlab
Elaine Mardis (Washington University)Andy Clark (Cornell University)Doug Smith (Agencourt)
Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.)
Derek BarnettEric Tsung
Aaron QuinlanDamien Croteau-Chonka
Weichun Huang
Michael Stromberg
Chip Stewart
Michele Busby