functionally annotate genomic variants
DESCRIPTION
This seminar aims at answering the question of what to make of the identified variants, specifically how to evaluate the quality, prioritize and functionally annotate the variants.TRANSCRIPT
The Queensland Brain Institute |
Functionally annotate variantsThe answer is not always 42 !
April 11, 2023
[by Swamibu]
The Queensland Brain Institute | April 11, 2023
Quick recap: DNA sequence read mapping
• Alignment -> Improving -> Variant calling -> Filtering
• Resulting file type: vcf• “What are the differences to the reference
genome?”
by Darwin Bell
Searching the haystack 3.5 million SNPs
The Queensland Brain Institute | April 11, 2023
Finding the causal variant in ideal situations*
• Spot the variant that is common amongst all affected but absent in all unaffected
• This variant is in a gene with known function and causes the protein to be disrupted
* e.g. some rare autosomal disease
The Queensland Brain Institute | April 11, 2023
In reality
• You can’t spot the difference– You deal with ~3.5 million SNPs– You need to employ methods that systematically identify
variants that stand out: GWAS–
• GWAS taught us that it is unlikely to find a causal common variant for complex diseases– Rare Variant ?– A bunch of rare and common variants ?– An even more complex model ?
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010 Oct 28;467(7319):1061-73. PubMed PMID: 20981092
The Queensland Brain Institute | April 11, 2023
Product Time
fastq 5 days
bam, vcf,… 3 weeks
paper >6 months
Per one-flowcell project
Production Informatics and Bioinformatics
Map to genome and generate raw genomic features (e.g. SNPs)
Analyze the data; Uncover the biological meaning
Produce raw sequence readsBasic ProductionInformatics
Advanced Production Inform.
BioinformaticsResearch
Statistical genetics
The Queensland Brain Institute | April 11, 2023
Discount erroneous SNPs ?
• Maybe most of my SNPs are not real and by excluding them I can find the causal variant?
• Biological verification– Re-sequencing with a *different* method (e.g. Sanger)
• “Yes the individual has a variant at location X”
– But you can’t do that for > 3 Million SNPs
• Bioinformatics verification– All quality measures are just proxies because we do not
know which variants are real
The Queensland Brain Institute | April 11, 2023
Quality control for variants
• Transition (A->G; C->T) to Transversion (purine<->pyrimidine) rate
• Concordance with known variants: dbSNP, HapMap, 1000genomes
• Mendelian Errors
“of de novo germline base substitution mutations to be aprox. 10(-8) per base pair per generation”
1000 genomes Project illumina
The Queensland Brain Institute | April 11, 2023
Just look at exons ?
• We know that there is a reduction of genetic variation in the neighborhood of genes, due to selection at linked sites (1000 genomes project).
• We could focus on them to get started– Variant in a protein coding region likely to be functional– We are more likely to find the meaning of a variant in a
protein coding region
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010 Oct 28;467(7319):1061-73. PubMed PMID: 20981092
The Queensland Brain Institute | April 11, 2023
Influence of a variant in protein coding region
• Nonsynonomous SNPs– Introduce stop codon– Disrupt structure
• Disrupt domain
• Indels – Cause frame shift
• Synonomous SNPs– Alter translation efficiency
• But, on average, each “normal” person is found to carry– 250 to 300 loss-of-function variants in annotated genes– 50 to 100 variants previously implicated in inherited
disorders.1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010 Oct 28;467(7319):1061-73. PubMed PMID: 20981092
The Queensland Brain Institute | April 11, 2023
Intergenic variants are also important
• Disrupt regulatory elements– Transcription factor binding sites– Splicer– ncRNA transcripts– mRNA editing
• Causing changes in the expression of proteins that have a downstream effect on their regulatory targets
Exons Gene Blue
PromoterEnhancer Silencer ncRNAExons Gene Green
Splicing
The Queensland Brain Institute | April 11, 2023
Catching a villain does not bring down the mob
• Autosomal translocation disrupting the function of the DISC gene is causing SZ in a family
• However, this is a rare event and can not explain heritability of SZ in the larger population.
Millar JK, Wilson-Annan JC, Anderson S, Christie S, Taylor MS, Semple CA, Devon RS, Clair DM, Muir WJ, Blackwood DH, Porteous DJ (May 2000). "Disruption of two novel genes by a translocation co-segregating with schizophrenia". Hum. Mol. Genet. 9 (9): 1415–23. doi:10.1093/hmg/9.9.1415. PMID 10814723.
chr1 chr11
Disc
The Queensland Brain Institute | April 11, 2023
Isolating SNPs that collectively explain liability
• Different populations may have their own “version” of a change that has the same downstream effects.– Unlikely a “one-variant one-phenotype”-case for many
diseases
• Prioritize variants or sets of variants to focus analysis on– Variants likely to be functional– Involved in the same pathway
• Model disease liability on this “subset” -> Statistical genetics: find variants with rel. large effect sizes that are able to explain a proportion of disease heritability in the population.1000 Genomes Project Consortium. A map of human genome variation from population-scale
sequencing. Nature. 2010 Oct 28;467(7319):1061-73. PubMed PMID: 20981092
The Queensland Brain Institute | April 11, 2023
Functional variants
• SIFT– Assigns a pre-computed score that says how likely this
substitution is tolerated given the sequence of homologous proteins.
• PolyPhen– Machine learning method predicting the impact of a
sequence on the protein’s structure.
• ANNOVAR– Annotate SNPs if they overlap functional elements, e.g.
domains, transcription factor binding site, splice variant,…
Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011 Jul 1;27(13):1741-8. PMID: 21596790
The Queensland Brain Institute | April 11, 2023
Custom filer approach with Excel
• Filter annotated variants with your requirements using excel to quickly identify a manageable list of “interesting” variants
• Approach taken by the Daimantina (Paul Leo)
exonic
Carried by 90% of affected
Carried by 10% of un-affected
Loss of function
The Queensland Brain Institute | April 11, 2023
Three things to remember
1. A “one-variant one-phenotype” model is rather unlikely
2. Variants in non-protein-coding regions are also important
3. New methods (bioinf and statistical genetics) need to be developed to address this problem
Addressed in upcoming discussion session run by Dr. Jake Gratten
The Queensland Brain Institute | April 11, 2023
Next week:
Abstract: The focus in this session will be put on the differences between standard DNA mapping and RNAseq-specific transcript mapping: identifying splice variants and isoforms. The issue of transcript quantification and genomic variants that can be identified from RNAseq data will be discussed.