Lectures 5 – Oct 12, 2011CSE 527 Computational Biology, Fall 2011
Instructor: Su-In LeeTA: Christopher Miles
Monday & Wednesday 12:00-1:20Johnson Hall (JHN) 022
Statistical Methods for Quantitative Trait Loci (QTL) Mapping II
1
Course Announcements HW #1 is out Project proposal
Due next Wed 1 paragraph describing what you’d like to work
on for the class project.
2
3
Why are we so different? Human genetic diversity
Different “phenotype”
Appearance Disease susceptibility Drug responses
: Different
“genotype” Individual-specific DNA 3 billion-long string……
ACTGTTAGGCTGAGCTAGCCCAAAATTTATAGCGTCGACTGCAGGGTCCACCAAAGCTCGACTGCAGTCGACGACCTAAAATTTAACCGACTACGAGATGGGCACGTCACTTTTACGCAGCTTGATGATGCTAGCTGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATGATCGTAGCTAAATGCATCAGCTGATTCACTTTTACGCAGCTTGATGACGACTACGAGATGGGCACGTTCACCATCTACTACTACTCATCTACTCATCAACCAAAAACACTACTCATCATCATCATCTACATCTATCATCATCACATCTACTGGGGGTGGGATAGATAGTGTGCTCGATCGATCGATCGTCAGCTGATCGACGGCAG……
Any observable characteristic or
trait
TGATCGAAGCTAAATGCATCAGCTGATGATCCTAGC…
TGATCGTAGCTAAATGCATCAGCTGATGATCGTAGC…
TGATCGCAGCTAAATGCAGCAGCTGATGATCGTAGC…
4
cellcell
Motivation Which sequence variation affects a trait?
Better understanding disease mechanisms Personalized medicine
Obese?15%Bold? 30%Diabetes? 6.2%Parkinson’s disease? 0.3%Heart disease?20.1%Colon cancer? 6.5%
:
A person
ACTTCGGAACATATCAAATCCAACGC
DNA – 3 billion long!
…… XXX
GTCDifferent instruction
Instruction
Sequence variations
XX
AG
A different person
Appearance, Personality, Disease susceptibility, Drug responses, …
QTL mapping Data
Phenotypes: yi = trait value for mouse i Genotypes: xik = 1/0 (i.e. AB/AA) of mouse i at marker k Genetic map: Locations of genetic markers
Goals: Identify the genomic regions (QTLs) contributing to variation in the phenotype.
5
:
1 2 3 4 5 … 3,000
mouseindividuals
0101100100…0111011110100…0010010110000…010
:
0000010100…101
0010000000…100
Genotype data
3000 markers
010:0
100:0
110:0
Phenotype data
6
Outline Statistical methods for mapping QTL
What is QTL? Experimental animals Analysis of variance (marker regression) Interval mapping (EM)
:
1 2 3 4 5 … 3,000
mouseindividuals 0
10:0
100:0
110:0
QTL?
Interval mapping [Lander and Botstein, 1989] Consider any one position in the genome as the
location for a putative QTL.
For a particular mouse, let z = 1/0 if (unobserved) genotype at QTL is AB/AA.
Calculate P(z = 1 | marker data). Need only consider nearby genotyped markers. May allow for the presence of genotypic errors.
Given genotype at the QTL, phenotype is distributed as N(µ+∆z, σ2).
Given marker data, phenotype follows a mixture of normal distributions.
7
IM: the mixture model
Let’s say that the mice with QTL genotype AA have average phenotype µA while the mice with QTL genotype AB have average phenotype µB.
The QTL has effect ∆ = µB - µA. What are unknowns?
µA and µB Genotype of QTL
8
0 7 20
M1 QTL M2
M1/M2Nearest flanking markers
65% AB35% AA
35% AB65% AA
99% AB
99% AA
IM: estimation and LOD scores Use a version of the EM algorithm to obtain
estimates of µA, µB, σ and expectation on z (an iterative algorithm).
Calculate the LOD score
Repeat for all other genomic positions (in practice, at 0.5 cM steps along genome).
9
A simulated example LOD score curves
10
Genetic markers
Interval mapping Advantages
Make proper account of missing data Can allow for the presence of genotypic errors Pretty pictures High power in low-density scans Improved estimate of QTL location
Disadvantages Greater computational effort (doing EM for each
position) Requires specialized software More difficult to include covariates Only considers one QTL at a time 11
Statistical significance Large LOD score → evidence for QTL Question: How large is large? Answer 1: Consider distribution of LOD score if there
were no QTL. Answer 2: Consider distribution of maximum LOD score.
12
Null distribution of the LOD scores at a particular genomic position (solid curve)
Null hypothesis – assuming that there are no QTLs segregating in the population.
)QTL no|(
)position at the QTL|(10log
DP
DP
Only ~3% of chance that the genomic position gets LOD score≥1.
Null distribution of the LOD scores at a particular genomic position (solid curve) and of the maximum LOD score from a genome scan (dashed curve).
LOD thresholds To account for the genome-wide search,
compare the observed LOD scores to the null distribution of the maximum LOD score, genome-wide, that would be obtained if there were no QTL anywhere.
LOD threshold = 95th percentile of the distribution of genome-wide max LOD, when there are no QTL anywhere.
Methods for obtaining thresholds Analytical calculations (assuming dense map of
markers) (Lander & Botstein, 1989) Computer simulations Permutation/ randomized test (Churchill & Doerge,
1994) 13
More on LOD thresholds Appropriate threshold depends on:
Size of genome Number of typed markers Pattern of missing data Stringency of significance threshold Type of cross (e.g. F2 intercross vs backcross) Etc
14
An example Permutation distribution for a trait
15
Modeling multiple QTLs Advantages
Reduce the residual variation and obtain greater power to detect additional QTLs.
Identification of (epistatic) interactions between QTLs requires the joint modeling of multiple QTLs.
Interactions between two loci
16
The effect of QTL1 is the same, irrespective of the genotype of QTL 2, and vice versa
The effect of QTL1 depends on the genotype of QTL 2, and vice versa
Trait variation that is not explained by a detected putative QTL.
Multiple marker model Let y = phenotype,
x = genotype data.
Imagine a small number of QTL with genotypes x1,…,xp
2p or 3p distinct genotypes for backcross and intercross, respectively
We assume that E(y|x) = µ(x1,…,xp), var(y|x) = σ2(x1,…,xp)
17
Multiple marker model Constant variance
σ2(x1,…,xp) =σ2
Assuming normality y|x ~ N(µg, σ2)
Additivity µ(x1,…,xp) = µ + ∑j ∆jxj
Epistasis µ(x1,…,xp) = µ + ∑j ∆jxj + ∑j,k wj,kxjxk
18
Computational problem N backcross individuals, M markers in all
with at most a handful expected to be near QTL
xij = genotype (0/1) of mouse i at marker j yi = phenotype (trait value) of mouse i
Assuming addivitity,yi = µ + ∑j ∆jxij + e which ∆j ≠ 0?
Variable selection in linear regression models 19
Mapping QTL as model selection Select the class of models
Additive models Additive with pairwise interactions Regression trees
20
xN…x1 x2
w1w2 wN
Phenotype (y)
y = w1 x1+…+wN xN+ε
minimizew (w1x1 + … wNxN - y)2 ?
21
Linear Regressionminimizew (w1x1 + … wNxN - y)2+model complexity
Search model space Forward selection (FS) Backward deletion (BE) FS followed by BE
xN…x1 x2
w1w2 wN
Phenotype (y)
parametersw1
w2 wN
Y = w1 x1+…+wN xN+ε
22
Lasso* (L1) Regression
minimizew (w1x1 + … wNxN - y)2+ C |wi|
Induces sparsity in the solution w (many wi‘s set to zero) Provably selects “right” features when many features are
irrelevant Convex optimization problem
No combinatorial search Unique global optimum Efficient optimization
xN…x1 x2
w1w2 wN
Phenotype (y)
parametersw1
w2
x1 x2
* Tibshirani, 1996
L2 L1
L1 term
Model selection Compare models
Likelihood function + model complexity (eg # QTLs)
Cross validation test Sequential permutation tests
Assess performance Maximize the number of QTL found Control the false positive rate
23
Outline Basic concepts
Haplotype, haplotype frequency Recombination rate Linkage disequilibrium
Haplotype reconstruction Parsimony-based approach EM-based approach
24
Review: genetic variation Single nucleotide polymorphism (SNP)
Each variant is called an allele; each allele has a frequency
Hardy Weinberg equilibrium (HWE) Relationship between allele and genotype frequencies
How about the relationship between alleles of neighboring SNPs?
We need to know about linkage (dis)equilibrium
25
Let’s consider the history of two neighboring alleles…
26
History of two neighboring alleles Alleles that exist today arose through
ancient mutation events…
27
Before mutation
After mutation
Mutation
A
A
C
28
C MutationC
G
G
G
G
History of two neighboring alleles One allele arose first, and then the other…
Before mutation
After mutation
A
A
C
C
Haplotype: combination of alleles present in a chromosome
Recombination can create more haplotypes
No recombination (or 2n recombination events)
Recombination
29
CC
GA
CC
GA
GC
CA
30
CC
G
G
Without recombination
A
C
CC
G
G
With recombination
A
C
CA
Recombinant haplotype
Haplotype A combination of alleles present in a chromosome Each haplotype has a frequency, which is the proportion
of chromosomes of that type in the population
31
Consider N binary SNPs in a genomic region There are 2N possible haplotypes
But in fact, far fewer are seen in human population
More on haplotype What determines haplotype frequencies?
Recombination rate (r) between neighboring alleles
Depends on the population r is different for different regions in genome
Linkage disequilibrium (LD) Non-random association of alleles at two or
more loci, not necessarily on the same chromosome.
Why do we care about haplotypes or LD? 32
References Prof Goncalo Abecasis (Univ of Michigan)’s lecture
note Broman, K.W., Review of statistical methods
for QTL mapping in experimental crosses Doerge, R.W., et al. Statistical issues in the
search for genes affecting quantitative traits in experimental populations. Stat. Sci.; 12:195-219, 1997.
Lynch, M. and Walsh, B. Genetics and analysis of quantitative traits. Sinauer Associates, Sunderland, MA, pp. 431-89, 1998.
Broman, K.W., Speed, T.P. A review of methods for identifying QTLs in experimental crosses, 1999.
33