association mapping for mendelian, and complex disorders january 16bafna, bfb
TRANSCRIPT
Association mapping for mendelian, and complex disorders
Apr 21, 2023 Bafna, BfB
UG Bioinformatics specialization at UCSD
Apr 21, 2023 Bafna, BfB
Abstraction of a causal mutation
Apr 21, 2023 Bafna, BfB
Looking for the mutation in populations
Apr 21, 2023 Bafna, BfB
A possible strategy is to collect cases (affected) and control individuals, and look for a mutation that consistently separates the two classes. Next, identify the gene.
Looking for the causal mutation in populations
Apr 21, 2023 Bafna, BfB
Case
Control
Problem 1: many unrelated common mutations, around one every 1000bp
Case
Control
Apr 21, 2023 Bafna, BfB
Looking for the causal mutation in populations
Apr 21, 2023 Bafna, BfB
Case
Control
Problem 2: We may not sample the causal mutation.
How to hunt for disease genes
• We are guided by two simple facts governing these mutations1. Nearby mutations are correlated2. Distal mutations are not
Apr 21, 2023 Bafna, BfB
Case
Control
This lecture
1. The bottom line: How do these facts help in finding disease genes?
2. The genetics: why should this happen?3. The computation4. Challenge of complex diseases.
Apr 21, 2023 Bafna, BfB
Case
Control
1. Nearby mutations are correlated2. Distal mutations are not
The basics of association mapping
• Sample a population of individuals at variant locations across the genome. Typically, these variants are single nucleotide polymorphisms (SNPs).
• Create a new bi-allelic variant corresponding to cases and controls, and test for correlations.
• By our assumptions, only the proximal variants will be correlated.
• Investigate genes near the correlated variants.
Apr 21, 2023 Bafna, BfB
Case
Control
00001
11
1
So, why should the proximal SNPs be correlated, and distal SNPs not?
Apr 21, 2023 Bafna, BfB
A bit of evolution
• Consider a fixed population (of chromosomes) evolving in time.
• Each individual arises from a unique, randomly chosen parent from the previous generation
Apr 21, 2023 Bafna, BfB
Time
Genealogy of a chromosomal population
Current (extant) population
Time
Apr 21, 2023 Bafna, BfB
Adding mutations
Apr 21, 2023 Bafna, BfB
Infinite sites assumption: A mutation occurs at most once at a site.
SNPs
Apr 21, 2023 Bafna, BfB
The collection of acquired mutations in the extant population describe the SNPs
Fixation and elimination
• Not all mutations survive.• Some mutations get fixed, and are no
longer polymorphic
Apr 21, 2023 Bafna, BfB
Removing extinct genealogies
Apr 21, 2023 Bafna, BfB
Removing fixed mutations
Apr 21, 2023 Bafna, BfB
The coalescent
Apr 21, 2023 Bafna, BfB
Disease mutation
Apr 21, 2023 Bafna, BfB
• We drop the ancestral chromosomes, and place the mutations on the internal branches.
Disease mutation
• A causal mutation creates a clade of affected descendants.
Apr 21, 2023 Bafna, BfB
Disease mutation
• Note that the tree (genealogy) is hidden. • However, the underlying tree topology
introduces a correlation between each pair of SNPs
Apr 21, 2023 Bafna, BfB
What have we learnt?
• The underlying genealogy creates a correlation between SNPs.
• By itself, this is not sufficient, because distal SNPs might also be correlated.
• Fortunately, for us the correlation between distal SNPs is quickly destroyed.
Apr 21, 2023 Bafna, BfB
Recombination
Apr 21, 2023 Bafna, BfB
Recombination
• In our idealized model, we assume that each individual chromosome chooses two parental chromosomes from the previous generation
Apr 21, 2023 Bafna, BfB
Multiple recombination change the local genealogy
Apr 21, 2023 Bafna, BfB
A bit of evolution
• Proximal SNPs are correlated, distal SNPs are not.• The correlation (Linkage disequilibirium) decays
rapidly after 20-50kb
Apr 21, 2023 Bafna, BfB
BASIC STATISTICS
Apr 21, 2023 Bafna, BfB
Testing for correlation
• In the absence of correlation
Apr 21, 2023 Bafna, BfB
Testing for correlation
• When correlated
Apr 21, 2023 Bafna, BfB
Assigning confidence
2 2
2 2
Apr 21, 2023 Bafna, BfB
X
X 4 0
0 4
X
X
Expected Observed
Assigning confidence
2 2
2 2
Apr 21, 2023 Bafna, BfB
X
X 4 0
0 4
X
X
Expected Observed
Assigning confidence
2.5 2.5
1.5 1.5
Apr 21, 2023 Bafna, BfB
X
3 2
1 2
XExpected Observed
X X
STATISTICAL TESTS OF ASSOCIATION
Apr 21, 2023 Bafna, BfB
Tests for association: Pearson
• Case-control phenotype:– Build a 3X2 contingency table– Pearson test (2df)=
Cases Controls
mm
Mm
MM O1 O2
O3 O4
O6O5
Apr 21, 2023 Bafna, BfB
The χ2 test
Cases Controls
mm
Mm
MM O1
O5
O3 O4
O2
O6
• The statistic behaves like a χ2 distribution.
• A p-value can be computed directly
Apr 21, 2023 Bafna, BfB
Χ2 distribution properties
A related distribution is the F-distribution
Apr 21, 2023 Bafna, BfB
Likelihood ratio
• Another way to check the extremeness of the distribution is by computing a (log) likelihood ratio.
• We have two competing hypothesis. Let N be the total number of observations
Apr 21, 2023 Bafna, BfB
LLR
• An LLR value close to 0, implies that the null hypothesis is true. Asymptotically, the LLR statistic also follows the chi-square distribution.
Apr 21, 2023 Bafna, BfB
Exact test
• The chi-square test does not work so well when the numbers are small.
• How can we compute an exact probability of seeing a specific distribution of values in the cells?
• Remember: we know the marginals (# cases, # controls,
Apr 21, 2023 Bafna, BfB
Fischer exact test
Cases Controls
mm
Mm
MM a
e
c d
b
f
• Num: #ways of getting configuration (a,b,c,d,e,f)
• Den: #ways of ensuring that the row sums and column sums are fixed
Apr 21, 2023 Bafna, BfB
Fischer exact test
• Remember that the probability of seeing any specific values in the cells is going to be small.
• To get a p-value, we must sum over all similarly extreme values. How?
Apr 21, 2023 Bafna, BfB
Test for association: Fisher exact test
• Here P is the probability of seeing the exact count.• The actual significance is computed by summing over
all such tables that are at least this extreme.
Cases Controls
mm
Mm
MM a
e
c d
b
f
Apr 21, 2023 Bafna, BfB
Test for association: Fisher exact test
Cases Controls
mm
Mm
MM a
e
c d
b
f
Apr 21, 2023 Bafna, BfB
Continuous outcomes
• Instead of discrete (Case/control) data, we have real-valued phenotypes– Ex: Diastolic Blood Pressure
• In this case, how do we test for association
Apr 21, 2023 Bafna, BfB
Continuous outcome ANOVA
• Often, the phenotypes are not offered as case-controls but like a continuous variable– Ex: blood-pressure measurements
• Question: Are the mean values of the two groups significantly different?
MM mm
Apr 21, 2023 Bafna, BfB
Two-sided t-test
• For two categories, ANOVA is also known as the t-test
• Assume that the variables from the two sets are drawn from Normal distributions– Different means, equal variances
• Null hypothesis is that they are both from the same distribution
Apr 21, 2023 Bafna, BfB
t-test continued
Apr 21, 2023 Bafna, BfB
Two-sample t-test
• As the variance is not known, we use an estimate S, defined by
• The T-statistic is given by
• Significant deviations from 0 are used to reject the Null hypothesisApr 21, 2023 Bafna, BfB
Two-sample t-test (unequal variances)
• If the variances cannot be assumed to be equal, we use
• The T-statistic is given by
mS
nS
XX22
21
21 =T
• Significant deviations from 0 are used to reject the Null hypothesisApr 21, 2023 Bafna, BfB
CONFOUNDING ASSOCIATION
Apr 21, 2023 Bafna, BfB
Confounding association
• Association tests can be confounded in many ways.
• We will explore a few of these, at a high level, and point to a few algorithmic problems.
Apr 21, 2023 Bafna, BfB
Confounding association with population substructure
Apr 21, 2023 Bafna, BfB
If the cases and controls are from different subpopulations, then sites with differing allele frequencies will confound association
The algorithmic problem
• Given a collection of individual genotypes, separate them into sub-populations.
• Idea: take markers that are very far apart so that no LD is possible.
• LD indicates structure.• Problem: Partition individuals into sub-populations so
that all correlation across pairs of distant markers is minimized. Penalty for increasing sub-populations?
Apr 21, 2023 Bafna, BfB
Confounding associations with genotypes
Apr 21, 2023 Bafna, BfB
A recombination event
Distinct haplotypes can create identical genotypes confounding association
Confounding association with interactions
• Individually, the markers do not correlate.• Together, they perfectly predict genes.• Find interacting partners that associate
with genes
Apr 21, 2023 Bafna, BfB
Confounding association with rare variants
• Not only can we have multiple interacting SNPs, each SNP individually occurs with very low frequency (< 1%).
• Can you detect associations with rare variants?
Apr 21, 2023 Bafna, BfB
Other problems
Apr 21, 2023 Bafna, BfB
• Can we reconstruct the phylogeny?• Useful for computing recombination
bounds.
Conclusion
• As individual genomes are sequenced, the association of variations with phenotypes presents many confounding challenges.
• Some of these challenges can be modeled as algorithmic problems.
• Population genetics should be part of a bioinformatics undergraduate curriculum.
Apr 21, 2023 Bafna, BfB
Thank you
• Homework (due Monday, March 15)– Describe an algorithm to detect associations
of interacting, rare-variants with a complex disease phenotype, in the presence of population substructure in the case-control population.
Apr 21, 2023 Bafna, BfB