![Page 1: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/1.jpg)
The Queensland Brain Institute |
Variant calling for disease association (2/2)Searching the haystack
April 11, 2023
[www.absolutefab.com]
![Page 2: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/2.jpg)
The Queensland Brain Institute | April 11, 2023
Quick recap: DNA sequence read mapping
• Sequencing->FASTQ->alignment to reference genome
• Resulting file type: BAM• Visualized in Genome Viewer• “What genomic regions were sequenced?”
Quality ControlProjects Fastq Bam
![Page 3: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/3.jpg)
The Queensland Brain Institute | April 11, 2023
Product Time
fastq 5 days
bam, vcf,… 3 weeks
paper >6 months
Per one-flowcell project
Production Informatics and Bioinformatics
Map to genome and generate raw genomic features (e.g. SNPs)
Analyze the data; Uncover the biological meaning
Produce raw sequence readsBasic ProductionInformatics
Advanced Production Inform.
BioinformaticsResearch
![Page 4: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/4.jpg)
The Queensland Brain Institute | April 11, 2023
Good mapping is crucial
• Mapping tools compromise accuracy for speed: approximate mapping.
• Identifying exactly where the reads map is the fundament for all subsequent analyses.
• The exact alignment of each read is especially important for variant calling.
by neilalderney123
![Page 5: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/5.jpg)
The Queensland Brain Institute | April 11, 2023
Mapping challenges
• Incorrect mapping– Amongst 3 billion bp (human) a
100-mer can occur by chance
• Multi-mappers– The genome has none-unique
regions (e.g. repeats) one read mapping to multiple sites can happen
• Duplicates– PCR duplicates can introduce
artifacts.
Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome. 2009 Jun;20(6):327-38. PMID: 19452216
ACGATATTACACGTACACTCAAGTCGTTCGGAACCT TTACACGTACA TACACGTACAC ACACGTACACT CACGTACTCTC CACGTACTCTC CACGTACTCTC CACGTACTCTC
Streptococcus suis (squares) Mus musculus (triangles)
![Page 6: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/6.jpg)
The Queensland Brain Institute | April 11, 2023
Methods for ensuring a good alignment
• Biological: Using paired end reads to increase coverage
• Bioinformatically: – Local-realignment– Base pair quality score re-calibration
~200 bp
?
Repeat region
![Page 7: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/7.jpg)
The Queensland Brain Institute | April 11, 2023
Local Realignment (GATK)
• Local realignment of all reads at a specific location simultaneously to minimize mismatches to the reference genome
• Reduces erroneous SNPs refines location of INDELS
original
realigned
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
QBI data
![Page 8: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/8.jpg)
The Queensland Brain Institute | April 11, 2023
Quality score recalibration (GATK)
• PHRED scores are predicted• Looking at all reads at a specific location allows a
better estimate on base pair quality score. – Excludes all known dbSNP sites – Assume all other mismatches are sequencing errors – Compute a new calibration table bases on mismatch
rates per position on the read
• Important for variant calling
Thomas Keane 9th European Conference on Computational Biology 26th September, 2010
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
![Page 9: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/9.jpg)
The Queensland Brain Institute | April 11, 2023
Recalibration of quality score
All bases are called with Q25
In reality not all are that good: bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20” GATK
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
![Page 10: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/10.jpg)
The Queensland Brain Institute | April 11, 2023
Variant calling methods
• > 15 different algorithm • Three categories
– Allele counting– Probabilistic methods, e.g.
Bayesian model • to quantify statistical
uncertainty• Assign priors e.g. by taking the
observed allele frequency of multiple samples into account
– Incorporating linkage disequilibrium (LD)• Specifically helpful for low
coverage and common variants
Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300.
http://seqanswers.com/wiki/Software/list
Ref
Ind1
Ind2
A
G/G
A/G
SNP variant
![Page 11: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/11.jpg)
The Queensland Brain Institute | April 11, 2023
VCF format
• Individual statistics– GT - genotype - 0/1– AD – total number of REF/ALT seen – 173 T, 141 A– DP – depth MAPQ > 17 – 282– GQ - Genotype Quality - 99 – PL – genotype likelihood - 0/0: 10-25.5=unlikely, 0/1:10-0=likely,
and 1/110-25.5=unlikely
• Location statistics, e.g.– Strand bias– How many reads have a deletion at this site
[HEADER LINES]#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
NA12878chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255
![Page 12: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/12.jpg)
The Queensland Brain Institute | April 11, 2023
When to call a variant ???REF: 77% ALT: 23%
HetREF: 50% ALT: 50%Hom
REF: 0% ALT: 100%
QBI data QBI data
![Page 13: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/13.jpg)
The Queensland Brain Institute | April 11, 2023
Hard Filtering
• Reducing false positives by e.g. requiring– Sufficient Depth– Variant to be in >30% reads– High quality– Strand balance – …
• Subjective and dangerous in this high dimensional search space
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).
Strand Bias
QBI data
![Page 14: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/14.jpg)
The Queensland Brain Institute | April 11, 2023
Gaussian mixture model
• Train on trusted variants and require the new variants to live in the same hyperspace
• Potential problem: Overfitting and biasing to features of known SNPs !!!
![Page 15: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/15.jpg)
The Queensland Brain Institute | April 11, 2023
Indel calling
• First local realignment might not be sufficient to confidently determine the beginning and end of indels
• Dindel-algorithm– Local realignment for
every indel candidate
Albers CA, Lunter G, Macarthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: Accurate indel calls from short-read data. Genome Res. 2011 Jun;21(6):961-73. PMID: 20980555.
![Page 16: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/16.jpg)
The Queensland Brain Institute | April 11, 2023
Recap
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
![Page 17: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/17.jpg)
The Queensland Brain Institute | April 11, 2023
Outcome: How many variants will I find ?
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
Hiseq: whole genome; mean coverage 60; 101PE; (NA12878)Exome: agilent capture; mean coverage 20; 76/101PE; (NA12878)
![Page 18: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/18.jpg)
The Queensland Brain Institute | April 11, 2023
Three things to remember
1. Getting the mapping right is critical2. Variant calling is not merely to count the
differences3. Just listing the variants does not tell you
anything biologically relevant.
by Яick Harris
Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011 Jul 1;27(13):1741-8. PMID: 21596790
![Page 19: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/19.jpg)
The Queensland Brain Institute | April 11, 2023
Next week:
Abstract: This seminar aims at answering the question of what to make of the identified variants, specifically how to evaluate the quality, prioritize and functionally annotate the variants.
![Page 20: Variant (SNPs/Indels) calling in DNA sequences, Part 2](https://reader035.vdocuments.site/reader035/viewer/2022062513/554ea772b4c9055f7b8b4b69/html5/thumbnails/20.jpg)
The Queensland Brain Institute | April 11, 2023
Walk-in-clinic