computational genetics winter 2013 lecture 9genetics.cs.ucla.edu/cs124/lecture/lecture9.pdf ·...
Post on 01-Aug-2020
2 Views
Preview:
TRANSCRIPT
Computational Genetics Winter 2013 Lecture 9
Eleazar Eskin University of California, Los Angeles
Ancestry Inference
Lecture 9. February 13th, 2013
(Slides from Eran Halperin)
Ancestry and ALL relapse
- Native American ancestry correlates with acute lymphoblastic leukemia’s relapse - Children with > 10% Native American ancestry have considerably higher chances to relapse.
Yang et al., Nature Genetics, 2011
Novembre et al., Nature, 2008
Learn about your ancestry 4
What is a population? n Defined by frequencies of mutations.
5
SNP A SNP B SNP C
European .3 .1 .4
African .2 .2 .3 Asian .2 .1 .5
Modeling Mutation Frequency
Spain Russia Germany
.8
.5
.2 Freq
uenc
y
? ?
6
Likelihood from Population
n SNP has alleles A,G. n Population Frequencies
Spain: P(A)=0.2 Germany: P(A)=0.5 Russia: P(A)=0.8
n Genotype Likelihoods Spain Germany Russia AA = (0.2)2 AA = (0.5)2 AA = (0.8)2 AG = 2(0.2)(0.8) AG = 2(0.5)2 AG =
2(0.2)(0.8) GG = (0.8)2 GG = (0.5)2 GG = (0.2)2
7
g = #A’s p = P(A) Likelihood: pg(1-p)(2-g) Log Likelihood: gln(p)+(2-g)ln(1-p)
Spatial Mutation Frequency
Spain Russia Germany
.8
.5
.2 Freq
uenc
y
France
8
2D Mutation Frequency Functions
n Mutation frequency function over a map
9
f (x) = 11+ exp(−aT x − b)
Fitting mutation frequency functions
n Fitting frequency functions ¨ Log likelihood: ¨ Fitting mutation frequency: ¨ Can be solved with convex optimization 10
gij ln f j (xi )+ (2− gij )ln(1− f j (xi ))j∑
i∑
minaj ,bj
gij ln f j (xi )+ (2− gij )ln(1− f j (xi ))i∑
f j (x) =1
1+ exp(−ajT x − bj )
Placing Individuals
n Given known frequency functions Given individual genotypes
n Maximum likelihood placement:
¨ Convex optimization
11 0
0.20.4
0.60.8
1
00.2
0.40.6
0.810
0.2
0.4
0.6
0.8
1
00.2
0.40.6
0.81
00.2
0.40.6
0.810
0.2
0.4
0.6
0.8
1
00.2
0.40.6
0.81
00.2
0.40.6
0.810
0.2
0.4
0.6
0.8
1
SNP A
SNP B
SNP C
minxi
gij ln f j (xi )+ (2− gij )ln(1− f j (xi ))j∑
Chicken or egg….
12 0
0.20.4
0.60.8
1
00.2
0.40.6
0.810
0.2
0.4
0.6
0.8
1
00.2
0.40.6
0.81
00.2
0.40.6
0.810
0.2
0.4
0.6
0.8
1
00.2
0.40.6
0.81
00.2
0.40.6
0.810
0.2
0.4
0.6
0.8
1
SNP A
SNP B
SNP C
Localization through a probabilistic model n The allele frequency at a SNP j for an
individual in position (x,y) is given by fj(x,y). n In order to find (x,y) for each individual and fj
for each SNP j, one can optimize the likelihood: ¨ Maximize the likelihood of the positions given the
slope functions. ¨ Maximize the likelihood of the slope functions given
the positions.
n Can be used to localize mixed individuals.
POPRES Data
n 3,192 individuals n 500,568 SNPs n Each individual has
4 grandparents from same country
n [Novembre et al., 2008] ¨ PCA Method
14
SPatial Ancestry Analysis (SPA)
¨ 3,192 individuals ¨ Each with ancestry
from single country in Europe
15
Globe Mapping
n Frequency functions defined on a sphere
16
Genomes and World Geography
Human Genome Diversity Panel Africa Europe Middle East Central South Asia East Asia Oceania America
17
Detection Selection
n Sharp allele frequency changes may indicate selection
18
f (x) = 11+ exp(−aT x − b)
Possible Selection
LCT Gene Region
19
Chromosome 2
Genes under selection
20
Typical Gene LCT Gene
Most extreme frequency changes
21
Principal Component Analysis
n Each individual can be thought of as a vector of {0,1,2}n where n is the number of SNPs. ¨ We get a matrix G of genotypes.
n PCA searches for the n-dimensional direction v such that the projection of the genotypes on v maximizes the variance.
PCA Overview
23
Principal Component Analysis
Plotting the data on a one dimensional line for which the variance is maximized.
PCA Mapping
25
HapMap PCA 1-2
26
!!"!#
!!"!$
!!"!%
!!"!&
!!"!'
!!"!(
!
!"!(
!"!'
!"!&
!!"!#
!!"!$
!!"!%
!!"!&
!!"!'
!!"!(
!
!"!(
!"!'
!"!&
!"!%
)
)
*+,-./0)/0.123,4)-0)567389123):5*
:3/8),12-;1032)9-38)<6,381,0)/0;)=1231,0)>7,6?1/0)/0.123,4
@/0)A8-0121)-0)B1-C-0DE)A8-0/
A8-0121)-0)F13,6?6G-3/0)H10I1,E)A6G6,/;6
J7C/,/3-)K0;-/02)-0)@672360E)L1M/2
N/?/0121)-0)L6O46E)N/?/0
P784/)-0)=1Q741E)R104/
F1M-./0)/0.123,4)-0)P62)*0D1G12E)A/G-+6,0-/
F//2/-)-0)R-04/9/E)R104/
L62./02)-0)K3/G4
S6,7Q/)-0)KQ/;/0E)<-D1,-/)T=123)*+,-./U
HapMap PCA 1-3
27
!!"!#
!!"!$
!!"!%
!!"!&
!!"!'
!!"!(
!
!"!(
!"!'
!"!&
!!"!#!!"!$
!!"!%!!"!&
!!"!'!!"!(
!!"!(
!"!'!"!&
!"!%
!!"!%
!!"!'
!
!"!'
!"!%
!"!#
!"!)
!"(
!"('
*
*
+,-./01*01/234-5*.1*67849:234*;6+
;409*-23.<2143*:.49*=7-492-1*01<*>2342-1*?8-7@201*01/234-5
A01*B9.1232*.1*C2.D.1EF*B9.10
B9.1232*.1*G24-7@7H.401*I21J2-F*B7H7-0<7
K8D0-04.*L1<.013*.1*A783471F*M2N03
O0@01232*.1*M7P57F*O0@01
Q8950*.1*>2R852F*S2150
G2N./01*01/234-5*.1*Q73*+1E2H23F*B0H.,7-1.0
G0030.*.1*S.150:0F*S2150
M73/013*.1*L40H5
T7-8R0*.1*LR0<01F*=.E2-.0*U>234*+,-./0V
HapMap PCA 1,2,4
28
!!"!#!!"!$
!!"!%!!"!&
!!"!'!!"!(
!!"!(
!"!'!"!&
!!"!#!!"!$
!!"!%!!"!&
!!"!'!!"!(
!!"!(
!"!'!"!&
!"!%!!"!)
!!"!#
!!"!%
!!"!'
!
!"!'
!"!%
!"!#
!"!)
!"(
!"('
*
*
+,-./01*01/234-5*.1*67849:234*;6+
;409*-23.<2143*:.49*=7-492-1*01<*>2342-1*?8-7@201*01/234-5
A01*B9.1232*.1*C2.D.1EF*B9.10
B9.1232*.1*G24-7@7H.401*I21J2-F*B7H7-0<7
K8D0-04.*L1<.013*.1*A783471F*M2N03
O0@01232*.1*M7P57F*O0@01
Q8950*.1*>2R852F*S2150
G2N./01*01/234-5*.1*Q73*+1E2H23F*B0H.,7-1.0
G0030.*.1*S.150:0F*S2150
M73/013*.1*L40H5
T7-8R0*.1*LR0<01F*=.E2-.0*U>234*+,-./0V
PCA localization of mixed individuals
100%
0% 20% 40% 60% 80% Percent
racial admixture
Individual subjects 1-90
Puerto Rican Population (GALA study, E. Burchard)
European
Native American
African
Recently Admixed Populations
Ancestral diversity
Ance
stry
Pro
porti
ons
0.0
0.2
0.4
0.6
0.8
1.0
GALA Mexicans (Founders) Native Am YRI CEU
Ance
stry
Pro
porti
ons
0.0
0.2
0.4
0.6
0.8
1.0
GALA Puerto Ricans (Founders) Native Am YRI CEU
Differences in ancestry across population can reveal historical patterns
Recently Admixed Populations
After generation 1
Recently Admixed Populations
After generation 10
Admixture Mapping Admixture Mapping: Finding regions with disproportional local ancestry.
Example: (Reich et al., Nature Genetics, 2005) • Multiple Sclerosis (MS) is more prevalent in northern Europeans than in Africans. • Consider a case of African-Americans with MS. • Look for regions that are significantly enriched with European ancestry. The incorporation of locus-specific ancestry to the statistic adds10%-50% power (Pasaniuc et al., PLOS Genetics, 2011)
Admixture Mapping Admixture Mapping: Finding regions with disproportional local ancestry.
Admixture Mapping has been widely performed (mostly on African-Americans) for BMI, hypertension, end-stage renal disease, prostate cancer, and others.
Inference of locus-specific ancestry n For each individual, the ancestry can be
described as a vector over {0,1,2}.
Breakpoints are Poisson distributed λ ~ num of generations * recombination rate
Sankararaman et al., , American J. Human Genet., 2008
Inferring Ancestries in Windows
Local predictions
With high likelihood, there is no breakpoint in the window. (length << 1/ λ)
Deciding on the ancestry according to majority vote across overlapping windows.
Sankararaman et al., , American J. Human Genet., 2008
Supervised Inference of Local Ancestry n For each individual, the ancestry can be
described as a vector over {0,1,2}. n The allele frequencies of the ancestral
populations are known
Supervised Inference of Local Ancestry n For each individual, the ancestry can be
described as a vector over {0,1,2}.
Breakpoints are Poisson distributed λ ~ num of generations * recombination rate
Pasaniuc et al., Bioinformatics., 2009
Inferring Ancestries in Windows
Local predictions With high likelihood, there is at most one breakpoint in the window. Pasaniuc et al., Bioinformatics., 2009
Breakpoint calculation
R As1
At1 At2
As2
F(s,t,r) – the probability of having As, At in the first r SNPs B(s,t,r) – the probability of having As, At in snps r,r+1,… F and B can be computed in linear time using dynamic programming
A C
C G
A G
A G
A C
G G
……
……
……
……
……
……
Ancestry using Hidden Markov Models A description of the haplotype generation as a Markov model.
Structure (Pritchard et al., 2000) Admixmap (Hoggart et al., 2003) Ancestrymap (Patterson et al., 2004) SABER (Tang et al., 2006)
A G
A C
G G
Ancestry using Hidden Markov Models
¨ A set of states per SNP ¨ Transition probabilities from SNP to SNP
0.7
0.3
Si Si+1
A: 0.9 G: 0.1
Ancestry using Hidden Markov Models
¨ Emission probabilities
Si
A C
C G
A G
A G
A C
G G
……
……
……
……
……
……
Ancestry using Hidden Markov Models n Hidden Markov Model per population
A C
C G
A G
A G
A C
G G
……
……
……
……
……
……
Ancestry using Hidden Markov Models
C C
C G
A C
A G
G G
G G
……
……
……
……
……
…… Transitions between populations based on recombination rates and number of generations
Estimate transition probabilities, and emission probabilities using Baum-Welch algorithm
Estimating the model parameters
47
010010001010100110…. 001010001000010100…. 010010101010100010…. 001010101000010100 010010101010000110…. 001010011000010100
010010001010100110…. 001010001000010100…. 010010101010100010…. 001010101000010100 010010101010000110…. 001010011000010100
A CC GA G
A GA CG G
…… …… ……
…… …… ……
C CC GA C
A GG GG G
…… …… ……
…… …… ……
48
A CC GA G
A GA CG G
…… …… ……
…… …… ……
C CC GA C
A GG GG G
…… …… ……
…… …… ……
Genotype: 010010222001020…..
Ancestry: 00000000000000000111111111111111122222222222222……..
Compressed vs. uncompressed model
49
n The number of states can be small (5-10). n In some cases the number of states is simply
the number of haplotypes in the reference population and the emission probabilities are based on these haplotypes.
Accuracy of the methods
50
n Current methods get around r2=0.99 for African American populations, and 0.94 for Latinos.
n Accurate methods can detect recombination rates (Hinch et al., 2011, Wegmann et al., 2011), selection forces (Tang et al., 2007).
top related