computational genetics winter 2013 lecture 9genetics.cs.ucla.edu/cs124/lecture/lecture9.pdf ·...

Computational Genetics Winter 2013 Lecture 9

Eleazar Eskin University of California, Los Angeles

Ancestry Inference

Lecture 9. February 13th, 2013

(Slides from Eran Halperin)

Ancestry and ALL relapse

- Native American ancestry correlates with acute lymphoblastic leukemia’s relapse - Children with > 10% Native American ancestry have considerably higher chances to relapse.

Yang et al., Nature Genetics, 2011

Novembre et al., Nature, 2008

Learn about your ancestry 4

What is a population? n  Defined by frequencies of mutations.

SNP A SNP B SNP C

European .3 .1 .4

African .2 .2 .3 Asian .2 .1 .5

Modeling Mutation Frequency

Spain Russia Germany

.2 Freq

Likelihood from Population

n  SNP has alleles A,G. n  Population Frequencies

Spain: P(A)=0.2 Germany: P(A)=0.5 Russia: P(A)=0.8

n  Genotype Likelihoods Spain Germany Russia AA = (0.2)2 AA = (0.5)2 AA = (0.8)2 AG = 2(0.2)(0.8) AG = 2(0.5)2 AG =

2(0.2)(0.8) GG = (0.8)2 GG = (0.5)2 GG = (0.2)2

g = #A’s p = P(A) Likelihood: pg(1-p)(2-g) Log Likelihood: gln(p)+(2-g)ln(1-p)

Spatial Mutation Frequency

Spain Russia Germany

.2 Freq

France

2D Mutation Frequency Functions

n  Mutation frequency function over a map

f (x) = 11+ exp(−aT x − b)

Fitting mutation frequency functions

n  Fitting frequency functions ¨  Log likelihood: ¨  Fitting mutation frequency: ¨  Can be solved with convex optimization 10

gij ln f j (xi )+ (2− gij )ln(1− f j (xi ))j∑

minaj ,bj

gij ln f j (xi )+ (2− gij )ln(1− f j (xi ))i∑

f j (x) =1

1+ exp(−ajT x − bj )

Placing Individuals

n  Given known frequency functions Given individual genotypes

n  Maximum likelihood placement:

¨  Convex optimization

0.20.4

0.60.8

0.40.6

gij ln f j (xi )+ (2− gij )ln(1− f j (xi ))j∑

Chicken or egg….

0.20.4

0.60.8

0.40.6

Localization through a probabilistic model n  The allele frequency at a SNP j for an

individual in position (x,y) is given by fj(x,y). n  In order to find (x,y) for each individual and fj

for each SNP j, one can optimize the likelihood: ¨  Maximize the likelihood of the positions given the

slope functions. ¨  Maximize the likelihood of the slope functions given

the positions.

n  Can be used to localize mixed individuals.

POPRES Data

n  3,192 individuals n  500,568 SNPs n  Each individual has

4 grandparents from same country

n  [Novembre et al., 2008] ¨  PCA Method

SPatial Ancestry Analysis (SPA)

¨  3,192 individuals ¨  Each with ancestry

from single country in Europe

Globe Mapping

n  Frequency functions defined on a sphere

Genomes and World Geography

Human Genome Diversity Panel Africa Europe Middle East Central South Asia East Asia Oceania America

Detection Selection

n  Sharp allele frequency changes may indicate selection

f (x) = 11+ exp(−aT x − b)

Possible Selection

LCT Gene Region

Chromosome 2

Genes under selection

Typical Gene LCT Gene

Most extreme frequency changes

Principal Component Analysis

n  Each individual can be thought of as a vector of {0,1,2}n where n is the number of SNPs. ¨  We get a matrix G of genotypes.

n  PCA searches for the n-dimensional direction v such that the projection of the genotypes on v maximizes the variance.

PCA Overview

Principal Component Analysis

Plotting the data on a one dimensional line for which the variance is maximized.

PCA Mapping

HapMap PCA 1-2

*+,-./0)/0.123,4)-0)567389123):5*

:3/8),12-;1032)9-38)<6,381,0)/0;)=1231,0)>7,6?1/0)/0.123,4

@/0)A8-0121)-0)B1-C-0DE)A8-0/

A8-0121)-0)F13,6?6G-3/0)H10I1,E)A6G6,/;6

J7C/,/3-)K0;-/02)-0)@672360E)L1M/2

N/?/0121)-0)L6O46E)N/?/0

P784/)-0)=1Q741E)R104/

F1M-./0)/0.123,4)-0)P62)*0D1G12E)A/G-+6,0-/

F//2/-)-0)R-04/9/E)R104/

L62./02)-0)K3/G4

S6,7Q/)-0)KQ/;/0E)<-D1,-/)T=123)*+,-./U

HapMap PCA 1-3

!!"!#!!"!$

!!"!%!!"!&

!!"!'!!"!(

!"!'!"!&

+,-./01*01/234-5*.1*67849:234*;6+

;409*-23.<2143*:.49*=7-492-1*01<*>2342-1*?8-7@201*01/234-5

A01*B9.1232*.1*C2.D.1EF*B9.10

B9.1232*.1*G24-7@7H.401*I21J2-F*B7H7-0<7

K8D0-04.*L1<.013*.1*A783471F*M2N03

O0@01232*.1*M7P57F*O0@01

Q8950*.1*>2R852F*S2150

G2N./01*01/234-5*.1*Q73*+1E2H23F*B0H.,7-1.0

G0030.*.1*S.150:0F*S2150

M73/013*.1*L40H5

T7-8R0*.1*LR0<01F*=.E2-.0*U>234*+,-./0V

HapMap PCA 1,2,4

!!"!#!!"!$

!!"!%!!"!&

!!"!'!!"!(

!"!'!"!&

!!"!#!!"!$

!!"!%!!"!&

!!"!'!!"!(

!"!'!"!&

!"!%!!"!)

+,-./01*01/234-5*.1*67849:234*;6+

;409*-23.<2143*:.49*=7-492-1*01<*>2342-1*?8-7@201*01/234-5

A01*B9.1232*.1*C2.D.1EF*B9.10

B9.1232*.1*G24-7@7H.401*I21J2-F*B7H7-0<7

K8D0-04.*L1<.013*.1*A783471F*M2N03

O0@01232*.1*M7P57F*O0@01

Q8950*.1*>2R852F*S2150

G2N./01*01/234-5*.1*Q73*+1E2H23F*B0H.,7-1.0

G0030.*.1*S.150:0F*S2150

M73/013*.1*L40H5

T7-8R0*.1*LR0<01F*=.E2-.0*U>234*+,-./0V

PCA localization of mixed individuals

0% 20% 40% 60% 80% Percent

racial admixture

Individual subjects 1-90

Puerto Rican Population (GALA study, E. Burchard)

European

Native American

African

Recently Admixed Populations

Ancestral diversity

GALA Mexicans (Founders) Native Am YRI CEU

GALA Puerto Ricans (Founders) Native Am YRI CEU

Differences in ancestry across population can reveal historical patterns

After generation 1

After generation 10

Admixture Mapping Admixture Mapping: Finding regions with disproportional local ancestry.

Example: (Reich et al., Nature Genetics, 2005) •  Multiple Sclerosis (MS) is more prevalent in northern Europeans than in Africans. •  Consider a case of African-Americans with MS. •  Look for regions that are significantly enriched with European ancestry. The incorporation of locus-specific ancestry to the statistic adds10%-50% power (Pasaniuc et al., PLOS Genetics, 2011)

Admixture Mapping Admixture Mapping: Finding regions with disproportional local ancestry.

Admixture Mapping has been widely performed (mostly on African-Americans) for BMI, hypertension, end-stage renal disease, prostate cancer, and others.

Inference of locus-specific ancestry n  For each individual, the ancestry can be

described as a vector over {0,1,2}.

Breakpoints are Poisson distributed λ ~ num of generations * recombination rate

Sankararaman et al., , American J. Human Genet., 2008

Inferring Ancestries in Windows

Local predictions

With high likelihood, there is no breakpoint in the window. (length << 1/ λ)

Deciding on the ancestry according to majority vote across overlapping windows.

Sankararaman et al., , American J. Human Genet., 2008

Supervised Inference of Local Ancestry n  For each individual, the ancestry can be

described as a vector over {0,1,2}. n  The allele frequencies of the ancestral

populations are known

Supervised Inference of Local Ancestry n  For each individual, the ancestry can be

described as a vector over {0,1,2}.

Breakpoints are Poisson distributed λ ~ num of generations * recombination rate

Pasaniuc et al., Bioinformatics., 2009

Inferring Ancestries in Windows

Local predictions With high likelihood, there is at most one breakpoint in the window. Pasaniuc et al., Bioinformatics., 2009

Breakpoint calculation

At1 At2

F(s,t,r) – the probability of having As, At in the first r SNPs B(s,t,r) – the probability of having As, At in snps r,r+1,… F and B can be computed in linear time using dynamic programming

……

Ancestry using Hidden Markov Models A description of the haplotype generation as a Markov model.

Structure (Pritchard et al., 2000) Admixmap (Hoggart et al., 2003) Ancestrymap (Patterson et al., 2004) SABER (Tang et al., 2006)

Ancestry using Hidden Markov Models

¨  A set of states per SNP ¨  Transition probabilities from SNP to SNP

Si Si+1

A: 0.9 G: 0.1

¨  Emission probabilities

……

Ancestry using Hidden Markov Models n  Hidden Markov Model per population

……

…… Transitions between populations based on recombination rates and number of generations

Estimate transition probabilities, and emission probabilities using Baum-Welch algorithm

Estimating the model parameters

010010001010100110…. 001010001000010100…. 010010101010100010…. 001010101000010100 010010101010000110…. 001010011000010100

A CC GA G

A GA CG G

…… …… ……

C CC GA C

A GG GG G

…… …… ……

A CC GA G

A GA CG G

…… …… ……

C CC GA C

A GG GG G

…… …… ……

Genotype: 010010222001020…..

Ancestry: 00000000000000000111111111111111122222222222222……..

Compressed vs. uncompressed model

n  The number of states can be small (5-10). n  In some cases the number of states is simply

the number of haplotypes in the reference population and the emission probabilities are based on these haplotypes.

Accuracy of the methods

n  Current methods get around r2=0.99 for African American populations, and 0.94 for Latinos.

n  Accurate methods can detect recombination rates (Hinch et al., 2011, Wegmann et al., 2011), selection forces (Tang et al., 2007).

computational genetics winter 2013 lecture 9genetics.cs.ucla.edu/cs124/lecture/lecture9.pdf ·...

Documents

lecture 9 -...

lecture 9 prof. lecture 9 population - rutgers...

lecture 9 - university of colorado...

computational genetics spring 2013 lecture...

lecture9 &10_tdm

oop lecture9

virtual memory management, part 2 - caltech...

lecture9 background

lecture9 distribution

lecture9 buffering

lecture 9: hough transform and...

lecture 9: policy gradient ii (post lecture)...

virtual memory management, part...

lecture 6: value function approximation -...

lecture9 transistors

lecture 9 raster data analysis -...

lecture: testing stationarity: structural change problem -...

lecture9 puzzlers

mbs lecture9

lecture9 memory