computational genetics winter 2013 lecture 9genetics.cs.ucla.edu/cs124/lecture/lecture9.pdf ·...

Post on 01-Aug-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Computational Genetics Winter 2013 Lecture 9

Eleazar Eskin University of California, Los Angeles

Ancestry Inference

Lecture 9. February 13th, 2013

(Slides from Eran Halperin)

Ancestry and ALL relapse

- Native American ancestry correlates with acute lymphoblastic leukemia’s relapse - Children with > 10% Native American ancestry have considerably higher chances to relapse.

Yang et al., Nature Genetics, 2011

Novembre et al., Nature, 2008

Learn about your ancestry 4

What is a population? n  Defined by frequencies of mutations.

5

SNP A SNP B SNP C

European .3 .1 .4

African .2 .2 .3 Asian .2 .1 .5

Modeling Mutation Frequency

Spain Russia Germany

.8

.5

.2 Freq

uenc

y

? ?

6

Likelihood from Population

n  SNP has alleles A,G. n  Population Frequencies

Spain: P(A)=0.2 Germany: P(A)=0.5 Russia: P(A)=0.8

n  Genotype Likelihoods Spain Germany Russia AA = (0.2)2 AA = (0.5)2 AA = (0.8)2 AG = 2(0.2)(0.8) AG = 2(0.5)2 AG =

2(0.2)(0.8) GG = (0.8)2 GG = (0.5)2 GG = (0.2)2

7

g = #A’s p = P(A) Likelihood: pg(1-p)(2-g) Log Likelihood: gln(p)+(2-g)ln(1-p)

Spatial Mutation Frequency

Spain Russia Germany

.8

.5

.2 Freq

uenc

y

France

8

2D Mutation Frequency Functions

n  Mutation frequency function over a map

9

f (x) = 11+ exp(−aT x − b)

Fitting mutation frequency functions

n  Fitting frequency functions ¨  Log likelihood: ¨  Fitting mutation frequency: ¨  Can be solved with convex optimization 10

gij ln f j (xi )+ (2− gij )ln(1− f j (xi ))j∑

i∑

minaj ,bj

gij ln f j (xi )+ (2− gij )ln(1− f j (xi ))i∑

f j (x) =1

1+ exp(−ajT x − bj )

Placing Individuals

n  Given known frequency functions Given individual genotypes

n  Maximum likelihood placement:

¨  Convex optimization

11 0

0.20.4

0.60.8

1

00.2

0.40.6

0.810

0.2

0.4

0.6

0.8

1

00.2

0.40.6

0.81

00.2

0.40.6

0.810

0.2

0.4

0.6

0.8

1

00.2

0.40.6

0.81

00.2

0.40.6

0.810

0.2

0.4

0.6

0.8

1

SNP A

SNP B

SNP C

minxi

gij ln f j (xi )+ (2− gij )ln(1− f j (xi ))j∑

Chicken or egg….

12 0

0.20.4

0.60.8

1

00.2

0.40.6

0.810

0.2

0.4

0.6

0.8

1

00.2

0.40.6

0.81

00.2

0.40.6

0.810

0.2

0.4

0.6

0.8

1

00.2

0.40.6

0.81

00.2

0.40.6

0.810

0.2

0.4

0.6

0.8

1

SNP A

SNP B

SNP C

Localization through a probabilistic model n  The allele frequency at a SNP j for an

individual in position (x,y) is given by fj(x,y). n  In order to find (x,y) for each individual and fj

for each SNP j, one can optimize the likelihood: ¨  Maximize the likelihood of the positions given the

slope functions. ¨  Maximize the likelihood of the slope functions given

the positions.

n  Can be used to localize mixed individuals.

POPRES Data

n  3,192 individuals n  500,568 SNPs n  Each individual has

4 grandparents from same country

n  [Novembre et al., 2008] ¨  PCA Method

14

SPatial Ancestry Analysis (SPA)

¨  3,192 individuals ¨  Each with ancestry

from single country in Europe

15

Globe Mapping

n  Frequency functions defined on a sphere

16

Genomes and World Geography

Human Genome Diversity Panel Africa Europe Middle East Central South Asia East Asia Oceania America

17

Detection Selection

n  Sharp allele frequency changes may indicate selection

18

f (x) = 11+ exp(−aT x − b)

Possible Selection

LCT Gene Region

19

Chromosome 2

Genes under selection

20

Typical Gene LCT Gene

Most extreme frequency changes

21

Principal Component Analysis

n  Each individual can be thought of as a vector of {0,1,2}n where n is the number of SNPs. ¨  We get a matrix G of genotypes.

n  PCA searches for the n-dimensional direction v such that the projection of the genotypes on v maximizes the variance.

PCA Overview

23

Principal Component Analysis

Plotting the data on a one dimensional line for which the variance is maximized.

PCA Mapping

25

HapMap PCA 1-2

26

!!"!#

!!"!$

!!"!%

!!"!&

!!"!'

!!"!(

!

!"!(

!"!'

!"!&

!!"!#

!!"!$

!!"!%

!!"!&

!!"!'

!!"!(

!

!"!(

!"!'

!"!&

!"!%

)

)

*+,-./0)/0.123,4)-0)567389123):5*

:3/8),12-;1032)9-38)<6,381,0)/0;)=1231,0)>7,6?1/0)/0.123,4

@/0)A8-0121)-0)B1-C-0DE)A8-0/

A8-0121)-0)F13,6?6G-3/0)H10I1,E)A6G6,/;6

J7C/,/3-)K0;-/02)-0)@672360E)L1M/2

N/?/0121)-0)L6O46E)N/?/0

P784/)-0)=1Q741E)R104/

F1M-./0)/0.123,4)-0)P62)*0D1G12E)A/G-+6,0-/

F//2/-)-0)R-04/9/E)R104/

L62./02)-0)K3/G4

S6,7Q/)-0)KQ/;/0E)<-D1,-/)T=123)*+,-./U

HapMap PCA 1-3

27

!!"!#

!!"!$

!!"!%

!!"!&

!!"!'

!!"!(

!

!"!(

!"!'

!"!&

!!"!#!!"!$

!!"!%!!"!&

!!"!'!!"!(

!!"!(

!"!'!"!&

!"!%

!!"!%

!!"!'

!

!"!'

!"!%

!"!#

!"!)

!"(

!"('

*

*

+,-./01*01/234-5*.1*67849:234*;6+

;409*-23.<2143*:.49*=7-492-1*01<*>2342-1*?8-7@201*01/234-5

A01*B9.1232*.1*C2.D.1EF*B9.10

B9.1232*.1*G24-7@7H.401*I21J2-F*B7H7-0<7

K8D0-04.*L1<.013*.1*A783471F*M2N03

O0@01232*.1*M7P57F*O0@01

Q8950*.1*>2R852F*S2150

G2N./01*01/234-5*.1*Q73*+1E2H23F*B0H.,7-1.0

G0030.*.1*S.150:0F*S2150

M73/013*.1*L40H5

T7-8R0*.1*LR0<01F*=.E2-.0*U>234*+,-./0V

HapMap PCA 1,2,4

28

!!"!#!!"!$

!!"!%!!"!&

!!"!'!!"!(

!!"!(

!"!'!"!&

!!"!#!!"!$

!!"!%!!"!&

!!"!'!!"!(

!!"!(

!"!'!"!&

!"!%!!"!)

!!"!#

!!"!%

!!"!'

!

!"!'

!"!%

!"!#

!"!)

!"(

!"('

*

*

+,-./01*01/234-5*.1*67849:234*;6+

;409*-23.<2143*:.49*=7-492-1*01<*>2342-1*?8-7@201*01/234-5

A01*B9.1232*.1*C2.D.1EF*B9.10

B9.1232*.1*G24-7@7H.401*I21J2-F*B7H7-0<7

K8D0-04.*L1<.013*.1*A783471F*M2N03

O0@01232*.1*M7P57F*O0@01

Q8950*.1*>2R852F*S2150

G2N./01*01/234-5*.1*Q73*+1E2H23F*B0H.,7-1.0

G0030.*.1*S.150:0F*S2150

M73/013*.1*L40H5

T7-8R0*.1*LR0<01F*=.E2-.0*U>234*+,-./0V

PCA localization of mixed individuals

100%

0% 20% 40% 60% 80% Percent

racial admixture

Individual subjects 1-90

Puerto Rican Population (GALA study, E. Burchard)

European

Native American

African

Recently Admixed Populations

Ancestral diversity

Ance

stry

Pro

porti

ons

0.0

0.2

0.4

0.6

0.8

1.0

GALA Mexicans (Founders) Native Am YRI CEU

Ance

stry

Pro

porti

ons

0.0

0.2

0.4

0.6

0.8

1.0

GALA Puerto Ricans (Founders) Native Am YRI CEU

Differences in ancestry across population can reveal historical patterns

Recently Admixed Populations

After generation 1

Recently Admixed Populations

After generation 10

Admixture Mapping Admixture Mapping: Finding regions with disproportional local ancestry.

Example: (Reich et al., Nature Genetics, 2005) •  Multiple Sclerosis (MS) is more prevalent in northern Europeans than in Africans. •  Consider a case of African-Americans with MS. •  Look for regions that are significantly enriched with European ancestry. The incorporation of locus-specific ancestry to the statistic adds10%-50% power (Pasaniuc et al., PLOS Genetics, 2011)

Admixture Mapping Admixture Mapping: Finding regions with disproportional local ancestry.

Admixture Mapping has been widely performed (mostly on African-Americans) for BMI, hypertension, end-stage renal disease, prostate cancer, and others.

Inference of locus-specific ancestry n  For each individual, the ancestry can be

described as a vector over {0,1,2}.

Breakpoints are Poisson distributed λ ~ num of generations * recombination rate

Sankararaman et al., , American J. Human Genet., 2008

Inferring Ancestries in Windows

Local predictions

With high likelihood, there is no breakpoint in the window. (length << 1/ λ)

Deciding on the ancestry according to majority vote across overlapping windows.

Sankararaman et al., , American J. Human Genet., 2008

Supervised Inference of Local Ancestry n  For each individual, the ancestry can be

described as a vector over {0,1,2}. n  The allele frequencies of the ancestral

populations are known

Supervised Inference of Local Ancestry n  For each individual, the ancestry can be

described as a vector over {0,1,2}.

Breakpoints are Poisson distributed λ ~ num of generations * recombination rate

Pasaniuc et al., Bioinformatics., 2009

Inferring Ancestries in Windows

Local predictions With high likelihood, there is at most one breakpoint in the window. Pasaniuc et al., Bioinformatics., 2009

Breakpoint calculation

R As1

At1 At2

As2

F(s,t,r) – the probability of having As, At in the first r SNPs B(s,t,r) – the probability of having As, At in snps r,r+1,… F and B can be computed in linear time using dynamic programming

A C

C G

A G

A G

A C

G G

……

……

……

……

……

……

Ancestry using Hidden Markov Models A description of the haplotype generation as a Markov model.

Structure (Pritchard et al., 2000) Admixmap (Hoggart et al., 2003) Ancestrymap (Patterson et al., 2004) SABER (Tang et al., 2006)

A G

A C

G G

Ancestry using Hidden Markov Models

¨  A set of states per SNP ¨  Transition probabilities from SNP to SNP

0.7

0.3

Si Si+1

A: 0.9 G: 0.1

Ancestry using Hidden Markov Models

¨  Emission probabilities

Si

A C

C G

A G

A G

A C

G G

……

……

……

……

……

……

Ancestry using Hidden Markov Models n  Hidden Markov Model per population

A C

C G

A G

A G

A C

G G

……

……

……

……

……

……

Ancestry using Hidden Markov Models

C C

C G

A C

A G

G G

G G

……

……

……

……

……

…… Transitions between populations based on recombination rates and number of generations

Estimate transition probabilities, and emission probabilities using Baum-Welch algorithm

Estimating the model parameters

47

010010001010100110…. 001010001000010100…. 010010101010100010…. 001010101000010100 010010101010000110…. 001010011000010100

010010001010100110…. 001010001000010100…. 010010101010100010…. 001010101000010100 010010101010000110…. 001010011000010100

A CC GA G

A GA CG G

…… …… ……

…… …… ……

C CC GA C

A GG GG G

…… …… ……

…… …… ……

48

A CC GA G

A GA CG G

…… …… ……

…… …… ……

C CC GA C

A GG GG G

…… …… ……

…… …… ……

Genotype: 010010222001020…..

Ancestry: 00000000000000000111111111111111122222222222222……..

Compressed vs. uncompressed model

49

n  The number of states can be small (5-10). n  In some cases the number of states is simply

the number of haplotypes in the reference population and the emission probabilities are based on these haplotypes.

Accuracy of the methods

50

n  Current methods get around r2=0.99 for African American populations, and 0.94 for Latinos.

n  Accurate methods can detect recombination rates (Hinch et al., 2011, Wegmann et al., 2011), selection forces (Tang et al., 2007).

top related