saurabhsinha mayo-illinois computational genomics workshop...
TRANSCRIPT
![Page 1: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/1.jpg)
Saurabh Sinha
Mayo-Illinois Computational Genomics WorkshopJune 14, 2019
Acknowledgment for some slides to Arián Avalos
![Page 2: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/2.jpg)
§ Molecular Markers
§ Genome Wide Association Studies (GWAS)
§ Functional Effects
![Page 3: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/3.jpg)
§ What is a SNP and a SNV?§ Single Nucleotide Polymorphysm§ Single Nucleotide Variant
I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGTI3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGTI6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGTI8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT
![Page 4: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/4.jpg)
§ A SNV is any change (e.g. a somatic mutation, even an artifact).
§ A SNP has defining criteria§ Polymorphic SNV, have “Major” and “minor” alleles§ Sometimes defined by frequency level (e.g. minimum allele frequency
of 5%)
§ For reference, the 1000 Genomes project identified ~41 Million SNPs across ~1000 Individuals.
![Page 5: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/5.jpg)
§ Both types of variants are relevant depending on the field§ Population geneticists conducting association test will focus on SNPs§ Cancer geneticists will instead be interested in SNVs
§ The terminology is further complicated in non-human biology (e.g. polyploidy, horizontal gene transfer, etc.)
![Page 6: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/6.jpg)
§ Example: § Cystic Fibrosis and the CFTR gene mutations
§ Approach: Genetic Linkage Analysis§ Genotype family members (some individuals carrying the disease)§ Find a marker that correlates with the disease§ Disease gene lies close to this marker
![Page 7: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/7.jpg)
§ Limitations of Genetic Linkage Analysis
§ Requires data from entire families, preferably large ones, where the trait is segregating
§ Linkage analysis less successful with common diseases, e.g., heart disease or cancers.
§ Requires single, large effect loci
![Page 8: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/8.jpg)
![Page 9: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/9.jpg)
§ Hypothesize that common diseases are influenced by common genetic variation in the population
§ Implications:§ Any individual variation (SNP) will have relatively small correlation
with the disease§ Multiple common alleles together influence the disease phenotype
§ This argues for population- rather than family-based studies.
![Page 10: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/10.jpg)
Bush W. S. & Moore J. H. (2012) PLoS Comput Biol 8(12): e1002822
![Page 11: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/11.jpg)
§ Zhang X. et al. (2012). PLoS Comput Biol 8(12): e1002828.
§ Bush W. S. & Moore J. H. (2012) PLoS Comput Biol 8(12): e1002822.
![Page 12: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/12.jpg)
§ Microarray – can assay 0.5 – 1.0 Million or more SNPs
§ Whole-genome sequencing (WGS) – assays (near) complete SNP profile
§ In non-human genetics, reduced-representation methods provide a middle-ground.
![Page 13: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/13.jpg)
§ Case / Control – qualitative, usually binary measure (e.g. disease vs. no disease)
§ Quantitative – continuous measure usually complex phenotypes (e.g. blood pressure, LDL levels)
§ Possible to look at more than one phenotype?
![Page 14: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/14.jpg)
§ Case / ControlDisease?
I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I2: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I8: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -
![Page 15: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/15.jpg)
§ Before analysis and interpretations a few considerations:§ Correlation is not causation
![Page 16: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/16.jpg)
§ Before analysis and interpretations a few considerations:§ Correlation is not causation§ Linkage disequilibrium (see later) § Population structure (see later)§ Phenotyping
![Page 17: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/17.jpg)
§ Further consider that even if the analysis is successful, findings can be hard to interpret
§ Example:§ SNP correlates well with heart disease§ Biochemical link? Behavioral link (you particularly like bacon…)?
![Page 18: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/18.jpg)
§ Case vs. Control
I1: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I2: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I3: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I4: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I5: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +I6: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I7: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I8: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +I9: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I10: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT +I11: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I12: AACGAGCTAGCGATCGATCGACTACGACTACGAGGT -I13: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT -I14: AACGAGCTAGCGATCGATCGACAACGACTACGAGGT +
A T
Case 3 1
Control 1 9
![Page 19: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/19.jpg)
§ Case vs. Control
§ The Fisher’s Exact Test
p-value < 0.05
3
All
14
Case4
A4
A T
Case 3 1
Control 1 9
![Page 20: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/20.jpg)
§ An favored alternative to the Fisher’s Exact Test is the Chi-Squared Test.
§ We conduct this test on EACH SNP separately, and get a corresponding p-value.
§ The smallest p-values point to the SNPs most associated with the disease.
![Page 21: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/21.jpg)
§ Either Fisher’s or the Chi-Squared Test are considered an allelic association test, i.e. we test if A instead of T at the polymorphic site correlates with the disease.
§ In a genotypic association test each position is a combination of two alleles, e.g. AA, TT, AT
§ We therefore correlate genotype with phenotype of the individual
![Page 22: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/22.jpg)
§ There are various options for a Case vs. Control genotypic association test
§ Example: § Dominant Model
AA or AT TT
Case ? ?
Control ? ?
![Page 23: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/23.jpg)
§ There are various options for a Case vs. Control genotypic association test
§ Example: § Dominant Model§ Recessive Model
AA AT or TT
Case ? ?
Control ? ?
![Page 24: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/24.jpg)
§ There are various options for a Case vs. Control genotypic association test
§ Example: § Dominant Model§ Recessive Model§ 2x3 Table
AA AT TT
Case O11 O12 O13
Control O21 O22 O23
Χ" =$%
$&
𝑂%& − 𝐸%&"
𝐸%&Chi-Squared Test
![Page 25: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/25.jpg)
3 0 1 2
10
4
5
6
7
8
9
X = Genotype
Y =
Phe
noty
pe
y = 1
.683
x + 5
.834
R² =
0.8
644
§ Quantitative Phenotypes
• 𝑌 = 𝑎 + 𝑏𝑋
• If no association, 𝑏 ≈ 0
• The more 𝑏 differs from 0, the stronger the association
• This is called linear regression
![Page 26: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/26.jpg)
§ Quantitative Phenotypes
§ Another statistical test commonly used on GWAS matrices is Analysis of Variance (ANOVA)
§ Statistical models for GWAS can get quite involved (can give references on request)
![Page 27: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/27.jpg)
Lambert et al., 2013: Nature Genetics 45, 1452
![Page 28: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/28.jpg)
§ Multiple Hypothesis Correction
§ What does p-value = 0.01 mean?
§ It means that the observed Genotype x Phenotype correlation has only 1% probability of happening just by chance.
§ What if we repeat the test for 1 Million SNPs? Of those tests, 1% (10,000 SNPs) will show this level of correlation,just by chance (and by definition)
![Page 29: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/29.jpg)
§ Multiple Hypothesis Correction
§ Bonferroni (Seen in statistics lecture)§ Multiply the p-value by the number of tests§ So if the original SNP had p-value 𝑝, the new p-value is defined as 𝑝; =𝑝 ×𝑁
§ With 𝑁 = 10?, a p-value of 10@A is downgraded to:
§ False Discovery Rate (seen in Statistics lecture)
𝑝; = 10@A × 10? = 10@B
This is quite good!
![Page 30: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/30.jpg)
§ So far we have tested each SNP separately, however recall our hypothesis that common diseases are influenced by common variants
§ Maybe considering two SNPs together will identify a stronger correlation with phenotype
§ Main problem: Number of pairs ~ 𝑁"
![Page 31: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/31.jpg)
§ Further consider, in genotyping we may be using a Microarray (e.g. 0.5 – 1 Million SNPs)
§ But there are many more sites in the human genome where variation may exist, will we then miss any causal variant outside the panel of ~1 Million?
§ Not necessarily
![Page 32: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/32.jpg)
§ Two sites close to each other may vary in a highly correlated manner, this is Linkage Disequilibrium (LD)
§ In this situation, lack of recombination events have made the inheritance of those two sites dependent
§ If two such sites have high LD, then one site can serve as proxy for the other
![Page 33: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/33.jpg)
§ So if sites X & Y have high LD, and X is in the Microarray, then knowing the allelic form of X informs the allelic form at Y
§ In this way a reduced panel can represent a larger number (all?) of the common SNPs
![Page 34: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/34.jpg)
§ A problem is that if X correlates with a disease, the causal variant may be either X or Y
![Page 35: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/35.jpg)
§ In many cases, able to find SNPs that have significant association with disease.
§ GWAS Catalog : http://www.genome.gov/26525384
§ Yet, final predictive power (ability to predict disease from genotype) is limited for complex diseases.
§ “Finding the Missing Heritability of Complex Diseases” http://www.genome.gov/27534229
![Page 36: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/36.jpg)
§ Increasingly, whole-exome and even whole-genome sequencing used for variant detection
§ Taking on the non-coding variants. Use functional genomics data as template
§ Network-based analysis rather than single-site or site-pairs analysis
§ Complement GWAS with family-based studies
![Page 37: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/37.jpg)
§ How do we predict how a variant is likely to be affecting protein function?
![Page 38: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/38.jpg)
§ Case:§ I found a SNP inside the coding sequence. Knowing how to translate
the gene sequence to a protein sequence, I discovered that this is a non-synonymous change, i.e., the encoded amino acid changes. This is an nsSNP.
§ Will that impact the protein’s function?
§ (And I don’t quite know how the protein functions in the first place ...)
![Page 39: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/39.jpg)
§ Two popular approaches:
§ PolyPhen 2.0 § Adzhubei, I. A. et al. (2010). Nat Methods 7(4):248-249
§ SIFT§ Kumar P. et al., (2009). Nat Protoc 4(7):1073-1081
![Page 40: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/40.jpg)
§ PolyPhen 2.0
![Page 41: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/41.jpg)
§ The PolyPhen 2.0 pipeline uses existing data sets for training and later evaluation of target data.
§ Specifically the HumDiv data base which is§ A compilation of all the damaging mutations with known effects of
molecular function§ A collection of non-damaging differences between human proteins
and those of closely related mammalian homologs
![Page 42: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/42.jpg)
![Page 43: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/43.jpg)
§ A look at the Multiple Sequence Alignment (MSA) part of the PolyPhen 2.0 pipeline:
![Page 44: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/44.jpg)
§ Of interest is the Position Specific Independent Count (PSIC) Score.
§ This score reflects the amino acid’s frequency at the specific position in the sequence given an MSA
![Page 45: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/45.jpg)
§ Example:
![Page 46: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/46.jpg)
§ To derive the PSIC score we first calculate the frequency of each amino acid:
p 𝑎, 𝑖 =𝑛 𝑎, 𝑖 FGG∑I 𝑛 𝑏, 𝑖 FGG
![Page 47: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/47.jpg)
§ The idea: § 𝑛 𝑎, 𝑖 FGG is not the raw count of amino acid “𝑎” at position 𝑖 but
rather it is adjusted for the many closely related sequences in the MSA
§ The PSIC score of a SNP 𝑎 → 𝑏 at position 𝑖 is given by:
p 𝑎, 𝑖 =𝑛 𝑎, 𝑖 FGG∑I 𝑛 𝑏, 𝑖 FGG
PSIC 𝑎 → 𝑏, 𝑖 ∝ ln𝑝 𝑏, 𝑖𝑝 𝑎, 𝑖
![Page 48: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/48.jpg)
§ Ultimately your derived score can be compared with the existing scores from HumDiv
![Page 49: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/49.jpg)
§ Classification
§ Naive Bayes method§ A type of classifier. Other classification algorithms include
“Support Vector Machine”, “Decision Tree”, “Neural Net”, “Random Forest” etc.
§ Sometimes called “Machine Learning”
§ What is a classification algorithm?
§ What is a Naive Bayes method/classifier?
![Page 50: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/50.jpg)
§ 𝑥SS, 𝑥S", … , 𝑥SU +
§ 𝑥"S, 𝑥"", … , 𝑥"U +
§ …
§ 𝑥%VS,S, 𝑥%VS,", … , 𝑥%VSU -
§ 𝑥%V",S, 𝑥%V",", … , 𝑥%V"U -
§ …
MODEL
Positive examples
Negative examples
Training Data
“Supervised Learning”
![Page 51: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/51.jpg)
MODEL Yes or No
Data Vector𝑥S, 𝑥", … , 𝑥U
![Page 52: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/52.jpg)
Pr(x1 | +),Pr(x1 | −),Pr(x2 | +),Pr(x2 | −),...Pr(xn | +),Pr(xn | −),
Training Data
Pr(+ | x1, x2,..., xn )∝ Pr(x1 | +)Pr(x2 | +)...Pr(xn | +)Pr(+)
Pr(− | x1, x2,..., xn )∝ Pr(x1 | −)Pr(x2 | −)...Pr(xn | −)Pr(−)+ or −
• Bayesian Inference:• Expresses how a subjective assessment
of likelihood should rationally change to account for evidence
![Page 53: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/53.jpg)
§ Evaluating a classifier: Cross-validation
FOLD 1
TRAIN ON THESEPREDICT AND EVALUATE
ON THESE
![Page 54: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/54.jpg)
§ Evaluating a classifier: Cross-validation
FOLD 2
PREDICT AND EVALUATE ON THESE
![Page 55: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/55.jpg)
§ Evaluating a classifier: Cross-validation
FOLD k
PREDICT AND EVALUATE ON THESE
![Page 56: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/56.jpg)
§ Evaluating a classifier: Cross-validation
Collect all evaluation results (from k “FOLD”s)
![Page 57: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/57.jpg)
§ Evaluating Classification Performance
Wikipedia
![Page 58: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/58.jpg)
§ The Receiver Operating Characteristic (ROC) curve§ True +ve vs False +ve
![Page 59: SaurabhSinha Mayo-Illinois Computational Genomics Workshop …publish.illinois.edu/computational-genomics-course/files/... · 2019-06-14 · SaurabhSinha Mayo-Illinois Computational](https://reader033.vdocuments.site/reader033/viewer/2022053011/5f0ece207e708231d441067c/html5/thumbnails/59.jpg)
§ What about those SNPs outside the coding regions?
§ Generally hard enough to predict within coding regions –regulatory sequences notoriously hard to pin down
§ Interesting new approaches uses coming up to predict impact on TF binding strength or DNA accessibility using machine learning