variant (snps/indels) calling in dna sequences, part 2
Post on 10-May-2015
4.191 Views
Preview:
DESCRIPTION
TRANSCRIPT
The Queensland Brain Institute |
Variant calling for disease association (2/2)Searching the haystack
April 11, 2023
[www.absolutefab.com]
The Queensland Brain Institute | April 11, 2023
Quick recap: DNA sequence read mapping
• Sequencing->FASTQ->alignment to reference genome
• Resulting file type: BAM• Visualized in Genome Viewer• “What genomic regions were sequenced?”
Quality ControlProjects Fastq Bam
The Queensland Brain Institute | April 11, 2023
Product Time
fastq 5 days
bam, vcf,… 3 weeks
paper >6 months
Per one-flowcell project
Production Informatics and Bioinformatics
Map to genome and generate raw genomic features (e.g. SNPs)
Analyze the data; Uncover the biological meaning
Produce raw sequence readsBasic ProductionInformatics
Advanced Production Inform.
BioinformaticsResearch
The Queensland Brain Institute | April 11, 2023
Good mapping is crucial
• Mapping tools compromise accuracy for speed: approximate mapping.
• Identifying exactly where the reads map is the fundament for all subsequent analyses.
• The exact alignment of each read is especially important for variant calling.
by neilalderney123
The Queensland Brain Institute | April 11, 2023
Mapping challenges
• Incorrect mapping– Amongst 3 billion bp (human) a
100-mer can occur by chance
• Multi-mappers– The genome has none-unique
regions (e.g. repeats) one read mapping to multiple sites can happen
• Duplicates– PCR duplicates can introduce
artifacts.
Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome. 2009 Jun;20(6):327-38. PMID: 19452216
ACGATATTACACGTACACTCAAGTCGTTCGGAACCT TTACACGTACA TACACGTACAC ACACGTACACT CACGTACTCTC CACGTACTCTC CACGTACTCTC CACGTACTCTC
Streptococcus suis (squares) Mus musculus (triangles)
The Queensland Brain Institute | April 11, 2023
Methods for ensuring a good alignment
• Biological: Using paired end reads to increase coverage
• Bioinformatically: – Local-realignment– Base pair quality score re-calibration
~200 bp
?
Repeat region
The Queensland Brain Institute | April 11, 2023
Local Realignment (GATK)
• Local realignment of all reads at a specific location simultaneously to minimize mismatches to the reference genome
• Reduces erroneous SNPs refines location of INDELS
original
realigned
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
QBI data
The Queensland Brain Institute | April 11, 2023
Quality score recalibration (GATK)
• PHRED scores are predicted• Looking at all reads at a specific location allows a
better estimate on base pair quality score. – Excludes all known dbSNP sites – Assume all other mismatches are sequencing errors – Compute a new calibration table bases on mismatch
rates per position on the read
• Important for variant calling
Thomas Keane 9th European Conference on Computational Biology 26th September, 2010
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
The Queensland Brain Institute | April 11, 2023
Recalibration of quality score
All bases are called with Q25
In reality not all are that good: bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20” GATK
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
The Queensland Brain Institute | April 11, 2023
Variant calling methods
• > 15 different algorithm • Three categories
– Allele counting– Probabilistic methods, e.g.
Bayesian model • to quantify statistical
uncertainty• Assign priors e.g. by taking the
observed allele frequency of multiple samples into account
– Incorporating linkage disequilibrium (LD)• Specifically helpful for low
coverage and common variants
Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300.
http://seqanswers.com/wiki/Software/list
Ref
Ind1
Ind2
A
G/G
A/G
SNP variant
The Queensland Brain Institute | April 11, 2023
VCF format
• Individual statistics– GT - genotype - 0/1– AD – total number of REF/ALT seen – 173 T, 141 A– DP – depth MAPQ > 17 – 282– GQ - Genotype Quality - 99 – PL – genotype likelihood - 0/0: 10-25.5=unlikely, 0/1:10-0=likely,
and 1/110-25.5=unlikely
• Location statistics, e.g.– Strand bias– How many reads have a deletion at this site
[HEADER LINES]#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
NA12878chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255
The Queensland Brain Institute | April 11, 2023
When to call a variant ???REF: 77% ALT: 23%
HetREF: 50% ALT: 50%Hom
REF: 0% ALT: 100%
QBI data QBI data
The Queensland Brain Institute | April 11, 2023
Hard Filtering
• Reducing false positives by e.g. requiring– Sufficient Depth– Variant to be in >30% reads– High quality– Strand balance – …
• Subjective and dangerous in this high dimensional search space
Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).
Strand Bias
QBI data
The Queensland Brain Institute | April 11, 2023
Gaussian mixture model
• Train on trusted variants and require the new variants to live in the same hyperspace
• Potential problem: Overfitting and biasing to features of known SNPs !!!
The Queensland Brain Institute | April 11, 2023
Indel calling
• First local realignment might not be sufficient to confidently determine the beginning and end of indels
• Dindel-algorithm– Local realignment for
every indel candidate
Albers CA, Lunter G, Macarthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: Accurate indel calls from short-read data. Genome Res. 2011 Jun;21(6):961-73. PMID: 20980555.
The Queensland Brain Institute | April 11, 2023
Recap
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
The Queensland Brain Institute | April 11, 2023
Outcome: How many variants will I find ?
DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889
Hiseq: whole genome; mean coverage 60; 101PE; (NA12878)Exome: agilent capture; mean coverage 20; 76/101PE; (NA12878)
The Queensland Brain Institute | April 11, 2023
Three things to remember
1. Getting the mapping right is critical2. Variant calling is not merely to count the
differences3. Just listing the variants does not tell you
anything biologically relevant.
by Яick Harris
Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011 Jul 1;27(13):1741-8. PMID: 21596790
The Queensland Brain Institute | April 11, 2023
Next week:
Abstract: This seminar aims at answering the question of what to make of the identified variants, specifically how to evaluate the quality, prioritize and functionally annotate the variants.
The Queensland Brain Institute | April 11, 2023
Walk-in-clinic
top related