variant calling - université de lille · variant callers are not concordant mean single-nucleotide...

68
Variant Calling

Upload: others

Post on 19-Jan-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant Calling

Page 2: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Ready for variant calling!!

Page 3: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Discover “variants” relative to a reference genome

From GATK Introduction to Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

International Solanaceae Genomics Project

Page 4: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Different types of variants

From GATK Introduction to Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 5: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant callers are not concordant

Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling pipelines

O'Rawe et al., Genome medicine 2013

Page 6: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Samtools mpileup:

- Simplest way to visualize SNP/indel calling and alignment.

- Piles up reads on each position

- Summarizes the base calls of aligned reads to a reference sequence

SAMtools mpileup

Page 7: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

SAMtools mpileup format specification

chr2 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&

chr2 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+

chr2 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6

chr2 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<

chr2 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<

chr2 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<

chr2 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<

1-based coordinatechromosome

reference base

nb of reads covering the site

read bases Base qualities

Page 8: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

SAMtools mpileup format specification

chr2 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&

chr2 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+

chr2 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<

chr2 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<

chr2 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<

1-based coordinatechromosome

reference base

nb of reads covering the site

read bases Base qualities

read base code :. match to the reference base on the forward strand, match on the reverse strand`ACGTN' mismatch on the forward strand`acgtn' mismatch on the reverse strand`\+[0-9]+[ACGTNacgtn]+' insertion between this reference position and the next reference position

Page 9: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

chr3 156 A 11 .$......+2AG.+2AG.+2AGGG <975;:<<<<<

chr5 200 A 20 ,,,,,..,.-4CACC.-4CACC....,.,,.^~. ==<<<<<<<<<<<::<;2<<

SAMtools mpileup format specification1-based coordinatechromosome

reference base nb of reads covering the site

read bases Base qualities

read base code :. match to the reference base on the forward strand, match on the reverse strand`ACGTN' mismatch on the forward strand`acgtn' mismatch on the reverse strand`+[0-9]+[ACGTNacgtn]' insertion between this reference position and the next reference position`^' marks the start of a read segment`$' marks the end of a read segment

Page 10: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

SAMtools mpileup

BUT

- is not a real variant caller

- must be combined to bcftools to perform the variant calling

> samtools mpileup -ugf myrefgenome.fa myreadsaligned.bam | bcftools call -vmO v -o myvariantscalled.vcf

## samtools mpileup# -u : generate uncompressed VCF/BCF output# -g : generate genotype likelihoods in BCF format# -f FILE : faidx indexed reference sequence file

## bcftools# -v : output variant sites only# -m : alternative model for multiallelic and rare-variant calling# -O : output type: 'v' uncompressed VCF [v]# -o, --output <file> : write output to a file [standard output]

samtools mpileup :- Collects summary information in the input

BAMs, computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format.

bcftools call :- Applies the prior and does the actual calling.

Page 11: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

VCF : Variant Call FormatStandardised format for storing the most prevalent types of sequence variationsText file format in 2 parts : header and body.

##fileformat=VCFv4.2##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20, length=62435964, assembly=B36, md5=f126cdf8a6e0c7f379d618ff66beb2da, species="Homo sapiens",taxonomy=x>##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA0000220 14370 rs6054257 ACG A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0/0:48:1:51,51 1|0:48:8:51,5120 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,320 1110696 rs6040355 A G,GT 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,5120 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2

Mandatory Header Lines

VCFheader

Body

Optional header lines (meta-data about the annotations in the VCF body)

Reference alleles (GT=0)

Alternate alleles (GT>0 is an index to the ALT column)

Phased data (G and C above are on the same chromosome)Deletion SNP InsertionOther event

Page 12: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

VCF : Variant Call FormatTypes of variants :SNPsAlignment VCF representationACGT POS REF ALTATGT 2 C T

From http://vcftools.sourceforge.net/VCF-poster.pdf

InsertionsAlignment VCF representationAC-GT POS REF ALTACTGT 2 C CT

Large structural variants VCF representation POS REF ALT INFO 100 T <DEL> SVTYPE=DEL;END=300

DeletionsAlignment VCF representationACGT POS REF ALTA--T 1 ACG A

Complex eventsAlignment VCF representationACGT POS REF ALTA-TT 1 ACG AT

Page 13: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

VCF : headerLines that start with #Some mandatory lines : file format, column headerOptional header lines contain meta-data about annotations in the vcf body

Meta-data may vary a lot from a variant caller to another one!

INFO vs FORMAT :INFO = annotations on variant as a wholeFORMAT = annotations that apply to each genotype

Page 14: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

VCF representation of genotypes

Zygosity VCF representation

Heterozygous 0/1, 1/2, 0/2, ...

Homozygous Reference Alternate

0/01/1, 2/2, 3/3, ...

Missing ./0, ./1, ./., ...

0 = Ref 1 = Alt1 2 = Alt2 3 = Alt3 ...

Page 15: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

VCF specification versionsVCF specifications evolve through versions!

Changes between VCFv4.2 and VCFv4.3 :

● VCF compliant implementations must support both LF and CR+LF newline conventions

● INFO and FORMAT tag names must match the regular expression ^[A-Za-z ][0-9A-Za-z .]*$

● Spaces are allowed in INFO field values ● Characters with special meaning (such as ’;’ in INFO, ’:’ in FORMAT, and ’%’ in

both) can be encoded using the percent encoding (see Section 1.2) • The character encoding of VCF files is UTF-8. 35

● The SAMPLE field can contain optional DOI URL for the source data file ● Introduced ##META header lines for defining phenotype metadata ● New reserved tag ”CNP” analogous to ”GP” was added. Both CNP and GP use 0 to

1 encoding, which is a change from previous phred-scaled GP. ● In order for VCF and BCF to have the same expressive power, we state explicitly

that Integers and Floats are 32-bit numbers. Integers are signed.● We state explicitly that zero length strings are not allowed, this includes the CHROM

and ID column, INFO IDs, FILTER IDs and FORMAT IDs. Meta-information lines can be in any order, with the exception of ##fileformat which must come first.

● All header lines of the form ##key= must have an ID value that is unique for a given value of ”key”. All header lines whose value starts with ”<” must have an ID field. Therefore, also ##PEDIGREE newly requires a unique ID.

● We state explicitly that duplicate IDs, FILTER, INFO or FORMAT keys are not valid. ● A section about gVCF was added, introduced the <*> symbolic allele.

...

Changes between VCFv4.1 and VCFv4.2:

● Information field format: adding source and version as recommended fields.

● INFO field can have one value for each possible allele (code R).

● For all of the ##INFO, ##FORMAT, ##FILTER, and ##ALT metainformation, extra fields can be included after the default fields.

● Alternate base (ALT) can include *: missing due to a upstream deletion.

● Quality scores, a sentence removed: High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired.

● Examples changed a bit.

Page 16: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK Best practices

https://software.broadinstitute.org/gatk/best-practices

Page 17: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK Variant discovery

HaplotypeCaller

GenotypeGVCFs

Once for each sample

Once for the full cohort

https://software.broadinstitute.org/gatk/best-practices

Page 18: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK HaplotypeCaller

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 19: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK HaplotypeCaller : step 1● Sliding window

● Count mismatches, indels and soft clips

● Measure of entropy

● Define active region according to a thresholdActive region

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 20: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK HaplotypeCaller : step 2● Local reassembly via graph

● Traverse graph to collect most likely haplotypes

● Align haplotypes using Smith-Waterman

Likely haplotypes and candidate variant sitesFrom GATK Best Practices for Variant Discovery Presentation,

https://software.broadinstitute.org/gatk/download/workshops

Page 21: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Recovering indels and remove artifacts

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 22: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Resolving complexity

HaplotypeCaller will use one representation for a cleaner output

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 23: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK HaplotypeCaller : step 3

● PairHMM aligns each read to each haplotype

● Considers base qualities

Likelihood of the haplotype given reads

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 24: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK HaplotypeCaller results

GVCF file for each sample

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 25: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GVCF Format : Why ?● VCF format only includes variant sites

● When performing joint calling : How to genotype homozygous reference samples ?

○ Can be homozygous reference (good coverage, alignment of good quality, etc.)○ Can be unknown (poor coverage, alignment of bad quality, outside of WES kit, etc.)

● Solution : recording homozygous reference sites during the calling

https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf

Page 26: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GVCF Format specifications

https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf

Page 27: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GVCF Format : example

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLEchr2 21232195 . G A,<NON_REF> 1284.77 . MLEAC=1,0;... GT:AD:DP:GQ 0/1:38,39,0:77:99chr2 21232196 . G <NON_REF> . . END=21232802 GT:DP:GQ:MIN_DP 0/0:94:99:63chr2 21232803 . T C,<NON_REF> 4959.77 . DP=120;... GT:AD:DP:GQ 1/1:0,120,0:120:99chr2 21232805 . T <NON_REF> . . END=21234696 GT:DP:GQ:MIN_DP 0/0:104:99:51chr2 21234697 . A <NON_REF> . . END=21234697 GT:DP:GQ:MIN_DP 0/0:58:96:58chr2 21234698 . G <NON_REF> . . END=21234722 GT:DP:GQ:MIN_DP 0/0:48:99:46chr2 21234723 . C <NON_REF> . . END=21234726 GT:DP:GQ:MIN_DP 0/0:43:90:42

GVCF Default Bands : 1, 2, 3, 4,....., 60, 70, 80, 90, 99

Page 28: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK GenotypeGVCFs

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 29: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK GenotypeGVCFs● Joint calling

● Determine most likely combination of allele(s) for each site

● Based on allele likelihoods (from PairHMM)

● Apply Bayes’ theorem with ploidy assumption

Genotype callshttps://software.broadinstitute.org/gatk/documentation/presentations

Page 30: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK Calling variants workflow

BAM file

BAM file

BAM file

VCF file

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 31: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK Calling variants N+1 problem

BAM file

BAM file

BAM file

VCF file

BAM file

Already processed

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 32: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK Calling variants tutorial

Page 33: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK: Filtering variants

https://software.broadinstitute.org/gatk/best-practices

Page 34: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : Filtering variants● Calling algorithms are very permissive

● Calling sets contain many false positives

● Two filtering approaches :

○ Hard filtering : using thresholds on annotations

○ Variant recalibration using machine learning

● Sensitivity vs Specificity

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 35: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : Hard filtering● Suitable for all experiments (targeted gene, WES, small sample size, etc.)

● Goal : define annotations and thresholds to filter bad variants

● Pros :○ Easy to perform

● Cons :○ Hard to define annotations to use

○ Hard to define thresholds

○ May filter good variants, may keep bad variants

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 36: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : Annotations● GATK adds annotations to each variant

● Represent properties/statistics describing each variant :○ Sequence context○ Depth of coverage○ Number of reads covering each allele○ Proportion of reads in forward/reverse orientation○ ...

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 37: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : Hard filtering example (QualByDepth)● QUAL divided by the unfiltered

depth of the non hom-ref samples● Avoid inflation caused when there

is deep coverage● Two peaks :

○ 12 : Heterozygous○ 32 : Homozygous alternate

● QD > 2 :○ Eliminates most of the false positives○ Keeps some bad variants○ Filters some good variants

https://software.broadinstitute.org/gatk/documentation/article.php?id=6925

Page 38: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK Hard Filtering example (FisherStrand)● Phred-scaled probability that there

is a strand bias● Identify alternate allele more seen

or less often on the forward or reverse strand than the reference allele

● Large intersection between bad and good variants

● FS < 60 :○ Removes many bad variants○ Keeps many bad variants

https://software.broadinstitute.org/gatk/documentation/article.php?id=6925

Page 39: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : Hard filtering examples

https://software.broadinstitute.org/gatk/documentation/article.php?id=6925

Page 40: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : Hard filtering recommendations● Filtering SNPs where any :

○ QD < 2.0○ MQ < 40.0 ○ FS > 60.0○ SOR > 3.0○ MQRankSum < -12.5○ ReadPosRankSum < -8.0

● Filtering Indels where any :○ QD < 2.0○ ReadPosRankSum < -20.0○ InbreedingCoeff1 < -0.8○ FS > 200.0○ SOR > 10.0

1 When sample size > 10

Warning : Threshold on maximum depth should not be used for WES data

https://software.broadinstitute.org/gatk/documentation/article.php?id=6925

Page 41: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : Hard filtering tutorial

Page 42: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : Variant Quality Score Recalibration (VQSR)● Preferred method

● Requires :○ DNA-seq data (not working on RNA-seq data)

○ Well curated training/truth resources (usually not available for non human organisms)

○ Large amount of variants (no targeted gene panels, etc.)

○ > 30 samples for WES data (1000G WES samples can be added if needed but not optimal)

● Based on machine learning

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 43: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : VQSRGoal : Train on high confidence known sites to determine the probability

that other sites are true or false

● Assume annotations tend to form Gaussian clusters

● Build a “Gaussian mixture model” from annotations of known variants

● Score all variants by where their annotations lie relative to the clusters

● Filter based on sensitivity to truth set

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 44: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : VQSR

Positive model (good variants)

Negative model (bad variants)True positives

False positives

pq

VQSLOD(x) = Log(p(x)/q(x))

Done for each annotation and then integrated into a single VQSLODFrom GATK Best Practices for Variant Discovery Presentation,

https://software.broadinstitute.org/gatk/download/workshops

Page 45: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : VQSR

Model trained on Hapmap Model applied to new SNPs

DePristo et al. Nat. Genet. 2011

Page 46: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : VQSR workflowOriginal SNPs + Original Indels

VariantRecalibrator

ApplyRecalibration

Recalibrated SNPs + Original Indels

SNP MODE

VariantRecalibrator

ApplyRecalibration Recalibrated SNPs + Recalibrated Indels

INDEL MODE From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 47: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : VQSR training and truth resources● Training : input variants that

overlap with these training sites to build the model

● Truth : determine where to set the cutoff

● Known : only for reporting purposes

● Prior : Phred-scaled estimate of data accuracy

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 48: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : VQSR SNP human resources● Hapmap

○ Training○ Truth○ Prior = 15

● Omni○ Training○ Truth○ Prior = 12

● 1000G SNPs High confidence○ Training○ Prior = 10

● dbSNP○ Known○ Prior = 2

https://software.broadinstitute.org/gatk/documentation/article?id=1259

Annotations : QD, MQ, MQRankSum, ReadPosRankSum, FS, SOR, DP1, InbreedingCoeff

1 Not to be used for WES

Page 49: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : VQSR Indel human resources● Mills indels

○ Training○ Truth○ Prior = 12

● dbSNP○ Known○ Prior = 2

https://software.broadinstitute.org/gatk/documentation/article?id=1259

Annotations : QD, DP1, FS, SOR, ReadPosRankSum, MQRankSum, InbreedingCoeff

1 Not to be used for WES

Page 50: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : VQSR plots

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 51: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : VQSR tranches plots

90 99 99.9 100 Truth sensitivity (%)

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 52: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

GATK : VQSR output example

#CHROM POS FILTER INFO

chr2 121456 VQSRTrancheSNP99.9to100.0 AC=2;..;NEGATIVE_TRAIN_SITE;VQSLOD=-4.532;culprit=MQ

chr2 121782 PASS AC=24;..;VQSLOD=3.278;culprit=FS

chr2 121987 VQSRTrancheINDEL99.0to99.9 AC=1;..;POSITIVE_TRAIN_SITE;VQSLOD=-2.312;culprit=SOR

From GATK Best Practices for Variant Discovery Presentation, https://software.broadinstitute.org/gatk/download/workshops

Page 53: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant normalization

Page 54: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Why is variant normalization necessary?Every variant in the human genome has various representations!

When merging variants from multiple variant callers for the same sample

⇒ which variants are common between callers ?

When comparing variant from the same variant caller but from different samples

⇒ which variants are shared between samples ?

A normalized variant is parsimonious and left-aligned

Page 55: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant represented in as few nucleotides as possible without an allele of length 0.

If the leftmost nucleotide of each variant is of the same type and the removal of the nucleotide from each allele will not result in an empty allele ⇒ superfluous nt on its left side!

Variant normalization - Parsimony

Page 56: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant normalization - Parsimony

https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png

Page 57: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant normalization - Parsimony

https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png

Page 58: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant normalization - Parsimony

https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png

Page 59: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant normalization - Parsimony

https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png

Page 60: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant normalization - Parsimony

https://genome.sph.umich.edu/wiki/File:Normalization_mnp.png

Parsimonious !

Page 61: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

A variant is left aligned if and only if it is no longer possible to shift its position to the left while keeping the length of all its alleles constant.

Variant normalization - Left Alignment

Page 62: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant normalization - Left Alignment

https://genome.sph.umich.edu/wiki/File:Normalization_str.png

Page 63: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant normalization - Left Alignment

https://genome.sph.umich.edu/wiki/File:Normalization_str.png

Empty allele is not allowed !

Page 64: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant normalization - Left Alignment

https://genome.sph.umich.edu/wiki/File:Normalization_str.png

Page 65: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant normalization - Left Alignment

https://genome.sph.umich.edu/wiki/File:Normalization_str.png

Page 66: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant normalization - Left Alignment

https://genome.sph.umich.edu/wiki/File:Normalization_str.png

Page 67: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Variant normalization - Left Alignment

https://genome.sph.umich.edu/wiki/File:Normalization_str.png

Page 68: Variant Calling - Université de Lille · Variant callers are not concordant Mean single-nucleotide variants (SNV) concordance over 15 exomes between five alignment and variant-calling

Ready for annotating variants!!