wi’08structure population sub-structure. wi’08structure projects harish/nitin gaurav (tuesday)...

Post on 17-Jan-2018

226 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Wi’08Structure Population sub-structure confounds association Consider an association test in a recently admixed population Suppose location A increases susceptibility to a disease common in Africans. Let locus B associate with skin-color. As expected, locus A is associated with the disease Unexpectedly, locus B also shows association The problem arises because the assumption of a random mating population is no longer true – The population has structure! D N D N D N D D N A B African European

TRANSCRIPT

Wi’08 Structure

Population sub-structure

Wi’08 Structure

Projects• Harish/Nitin• Gaurav (Tuesday)• Stefano/Hossein (Tuesday)• Nisha/Yu• David• Jian/Josue (Tuesday)

Wi’08 Structure

Population sub-structure confounds association

• Consider an association test in a recently admixed population

• Suppose location A increases susceptibility to a disease common in Africans.

• Let locus B associate with skin-color.• As expected, locus A is associated

with the disease• Unexpectedly, locus B also shows

association• The problem arises because the

assumption of a random mating population is no longer true– The population has structure!

0 .. 1 D0 .. 1 N0 .. 0 D1 .. 1 N0 .. 1 D0 .. 1 D0 .. 1 D0 .. 1 D0 .. 1 D

1 .. 0 N1 .. 0 N0 .. 0 D1 .. 1 D1 .. 0 N1 .. 0 N1 .. 0 N1 .. 0 N1 .. 0 N

A B

Afr

ican

Euro

pean

Wi’08 Structure

Population sub-structure can increase LD

• Consider two populations that were isolated and evolving independently.

• They might have different allele frequencies in some regions.

• Pick two regions that are far apart (LD is very low, close to 0)

0 .. 10 .. 10 .. 01 .. 10 .. 10 .. 10 .. 10 .. 10 .. 1

1 .. 01 .. 00 .. 01 .. 11 .. 01 .. 01 .. 01 .. 01 .. 0

Pop. A

Pop. B

p1=0.1q1=0.9P11=0.1D=0.01

p1=0.9q1=0.1P11=0.1D=0.01

Wi’08 Structure

Recent ad-mixing of population

• If the populations came together recently (Ex: African and European population), artificial LD might be created.

• D = 0.15 (instead of 0.01), increases 10-fold

• This spurious LD might lead to false associations

• Other genetic events can cause LD to increase, and one needs to be careful

0 .. 10 .. 10 .. 01 .. 10 .. 10 .. 10 .. 10 .. 10 .. 1

1 .. 01 .. 00 .. 01 .. 11 .. 01 .. 01 .. 01 .. 01 .. 0

Pop. A+B

p1=0.5q1=0.5P11=0.1D=0.1-0.25=0.15

Wi’08 Structure

Determining population sub-structure

• Given a mix of people, can you sub-divide them into ethnic populations.

• Turn the ‘problem’ of spurious LD into a clue. – Find markers that are too far apart to show LD– If they do show LD (correlation), that shows the

existence of multiple populations.– Sub-divide them into populations so that LD

disappears.

Wi’08 Structure

Determining Population sub-structure• Same example as before:• The two markers are too

similar to show any LD, yet they do show LD.

• However, if you split them so that all 0..1 are in one population and all 1..0 are in another, LD disappears

0 .. 10 .. 10 .. 01 .. 10 .. 10 .. 10 .. 10 .. 10 .. 1

1 .. 01 .. 00 .. 01 .. 11 .. 01 .. 01 .. 01 .. 01 .. 0

Wi’08 Structure

Iterative algorithm for population sub-structure

• Define• N = number of individuals (each has a single

chromosome)• k = number of sub-populations. • Z {1..k}N is a vector giving the sub-

population. – Zi=k’ => individual i is assigned to population k’

• Xi,j = allelic value for individual i in position j• Pk,j,l = frequency of allele l at position j in

population k

Wi’08 Structure

Example• Ex: consider the

following assignment• P1,1,0 = 0.9• P2,1,0 = 0.1

0 .. 10 .. 10 .. 01 .. 10 .. 10 .. 10 .. 10 .. 10 .. 10 .. 1

1 .. 01 .. 00 .. 01 .. 11 .. 01 .. 01 .. 01 .. 01 .. 01 .. 0

1111111111

2222222222

P1,1,0

Wi’08 Structure

Goal• X is known.• P, Z are unknown. • The goal is to estimate Pr(P,Z|X)• Various learning techniques can be employed.

– maxP,Z Pr(X|P,Z) (Max likelihood estimate)– maxP,Z Pr(X|P,Z) Pr(P,Z) (MAP)– Sample P,Z from Pr(P,Z|X)

• Here a Bayesian (MCMC) scheme is employed to sample from Pr(P,Z|X). We will only consider a simplified version

Wi’08 Structure

Algorithm:Structure

• Iteratively estimate – (Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m))

• After ‘convergence’, Z(m) is the answer.

• Iteration– Guess Z(0)

– For m = 1,2,..• Sample P(m) from Pr(P | X, Z(m-1))• Sample Z(m) from Pr(Z | X, P(m))

• How is this sampling done?

(Z1,P1)

(Z3,P3)

(Z4,P4)

(Z2,P2)

Wi’08 Structure

Example

• Choose Z at random, so each individual is assigned to be in one of 2 populations. See example.

• Now, we need to sample P(1) from Pr(P | X, Z(0))

• Simply count• Nk,j,l = number of people in

population k which have allele l in position j

• pk,j,l = Nk,j,l / N

0 .. 10 .. 10 .. 01 .. 10 .. 10 .. 10 .. 10 .. 10 .. 10 .. 1

1 .. 01 .. 00 .. 01 .. 11 .. 01 .. 01 .. 01 .. 01 .. 01 .. 0

1221121212

1221121221

Wi’08 Structure

Example• Nk,j,l = number of people in

population k which have allele l in position j

• pk,j,l = Nk,j,l / Nk,j,*• N1,1,0 = 4• N1,1,1 = 6• p1,1,0 = 4/10• p1,2,0 = 4/10 • Thus, we can sample P(m)

0 .. 10 .. 10 .. 01 .. 10 .. 10 .. 10 .. 10 .. 10 .. 10 .. 1

1 .. 01 .. 00 .. 01 .. 11 .. 01 .. 01 .. 01 .. 01 .. 01 .. 0

1221121212

1221121221

Wi’08 Structure

Sampling Z• Pr[Z1 = 1] = Pr[”01” belongs to population 1]?• We know that each position should be in

linkage equilibrium and independent.• Pr[”01” |Population 1] = p1,1,0 * p1,2,1

=(4/10)*(6/10)=(0.24) • Pr[”01” |Population 2] = p2,1,0 * p2,2,1 =

(6/10)*(4/10)=0.24• Pr [Z1 = 1] = 0.24/(0.24+0.24) = 0.5

Assuming, HWE, and LE

Wi’08 Structure

Sampling

• Suppose, during the iteration, there is a bias.

• Then, in the next step of sampling Z, we will do the right thing

• Pr[“01”| pop. 1] = p1,1,0 * p1,2,1 = 0.7*0.7 = 0.49

• Pr[“01”| pop. 2] = p2,1,0 * p2,2,1 =0.3*0.3 = 0.09

• Pr[Z1 = 1] = 0.49/(0.49+0.09) = 0.85• Pr[Z6 = 1] = 0.49/(0.49+0.09) = 0.85• Eventually all “01” will become 1

population, and all “10” will become a second population

0 .. 10 .. 10 .. 01 .. 10 .. 10 .. 10 .. 10 .. 10 .. 10 .. 1

1 .. 01 .. 00 .. 01 .. 11 .. 01 .. 01 .. 01 .. 01 .. 01 .. 0

1112121211

2221221221

Wi’08 Structure

Allowing for admixture• Define qi,k as the fraction of individual i

that originated from population k.• Iteration

– Guess Z(0)

– For m = 1,2,..• Sample P(m),Q(m) from Pr(P,Q | X, Z(m-1))• Sample Z(m) from Pr(Z | X, P(m),Q(m))

Wi’08 Structure

Estimating Z (admixture case)• Instead of estimating Pr(Z(i)=k|X,P,Q), (origin

of individual i is k), we estimate Pr(Z(i,j,l)=k|X,P,Q)

i,1i,2

j

Pr(Z i, j,l = k | X,P,Q) =qi,k Pr(X i, j,l | Z i, j,l = k,P)

qi,k' Pr(X i, j ,l | Z i, j ,l = k',P)k'

Wi’08 Structure

Results: Thrush data• For each

individual, q(i) is plotted as the distance to the opposite side of the triangle.

• The assignment is reliable, and there is evidence of admixture.

Wi’08 Structure

NJ versus Structure:thrush data

• Objective function is different in standard clustering algorithms!

Wi’08 Structure

Population Structure in isolated human populations

• 377 locations (loci) were sampled in 1000 people from 52 populations.

• 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003)

AfricaEurasia East Asia

America

Ocea

nia

Wi’08 Structure

Population sub-structure:research problem• Systematically explore the effect of

admixture. Can admixture be predicted for a locus, or for an individual

• The sampling approach may or may not be appropriate. Formulate as an optimization/learning problem:– (w/out admixture). Assign individuals to sub-

populations so as to maximize linkage equilibrium, and hardy weinberg equilibrium in each of the sub-populations

– (w/ admixture) Assign (individuals, loci) to sub-populations

Wi’08 Structure

Population Substructure Conclusions• Populations substructure (violating the

assumption of random mating) can create artificial linkage disequlibrium, and false associations

• One way out of it is to identify and sub-divide the populations into sub-populations.

• Another approach is to ‘correct’ for the effects of structure.

Wi’08 Structure

Detecting structural variations

Wi’08 Structure

Genomic variation• Inherited phenotypes arise from genetic variation

– Point mutations, indels, genetic recombination, but also….– Structural Variation

• Insertions, deletions, translocations, inversions, fissions, fusions….

deletions

inversions

translocations

Wi’08 Structure

Fluoroscent in situ hybridization

• (Cancer genomes show extensive structural variation)

• Historically, larger structural variations (easily observed under a microscope were commonly studied, mostly in the context of diseases…

Wi’08 Structure

HapMap and other projects

• The development of molecular techniques (sequencing, genotyping) shifted interest onto smaller scale mutations. They were assumed to be the dominant source of genetic variation

• It turns out that structural variation in normal populations is more common than assumed.

Wi’08 Structure

1. Inversions and HapMap

A

AA

A

A

ACC

GG

G

G

T

TT

G

G

GAG

A

AA

G

A

A GGA

CTCA

• SNP data is seemingly oblivious to Inversions (& other structural polymorphisms)

• Can we detect a signal for inversion polymorphisms in SNPs?

A

A

A

AA

G

G

G

G

GG

G

GG

reference genome sequence

population

Wi’08 Structure

Array-CGH

Garnis et al. Molecular Cancer 2004

Wi’08 Structure

CNVs dominate, possibly because of technology

Wi’08 Structure

Structural variations

• Structural variants are common, but non CNVs are hard to detect.

• Some of the translocations are important in disease

Wi’08 Structure

2. Variations in tumor genomes

• Fusion observed in leukemia, lymphoma, and sarcomas

• “Philadelphia Translocation” Drugs target this fusion protein

• Can we detect fused genes in tumors?

Wi’08 Structure

3. Assaying for specific variations

• Most tumors start off with a single cell, which then proliferate.

• Drugs like Gleevec are used well after cancer has taken hold.

• Can we detect the cancer early by detecting the genomic abnormality?

– If a very few cells in the person are cancerous, can we still detect it?

• Can we track a patient through his treatment?

Wi’08 Structure

Today

• Detection of inversion polymorphisms (& other copy neutral variation) using genotype data.

• Detection of gene fusion events using sequencing

• PCR based detection of variation in the presence of heterosomy: only a fraction of cells carry the variant genome

Wi’08 Structure

Chr. 17 Inversion

Wi’08 Structure

Inversion polymorphisms

Wi’08 Structure

Common inverted haplotype

Wi’08 Structure

LLR statistic computation

• Multi-allelic markers are chosen in blocks around the putative breakpoints

• LD is measured using multi-allelic D-statistic

M1 M2 M3 M4

LD24 LD34LD13LD12

Bansal et al.Genome Research, 2007

Wi’08 Structure

LLRl

M1 M2 M3 M4

LD13LD13

LLRl = log Pr(LD12,LD13 | M1 → M3 → M2 → M4 )Pr(LD12,LD13 | M1 → M2 → M3 → M4 )

⎛ ⎝ ⎜

⎞ ⎠ ⎟

d12d13

Wi’08 Structure

LLRR

M1 M2 M3 M4

LD24 LD34

LLRR = log Pr(LD24 ,LD34 | M1 → M3 → M2 → M4 )Pr(LD24 ,LD34 | M1 → M2 → M3 → M4 )

⎛ ⎝ ⎜

⎞ ⎠ ⎟

d12d24

Wi’08 Structure

Inversion polymorphisms in human

• LLRl and LLRr define our inversion statistic • Large positive values for both likelihood ratios

indicate inversion • A permutation test to estimate the significance

of the statistic• Overlapping breakpoints are merged

Wi’08 Structure

Power

• Difficult to simulate inversions using the coalescent• We simulated inversions on HapMap data.• (a) varying frequency, 500kb inversion• (b) varying length

Wi’08 Structure

Inversion polymorphisms

• We used the Phase I phased haplotype data from the International HapMap project – 3 samples: CEU, YRI, Asian: CHB+JPT about 900K

SNPs • The location between two adjacent SNPs was

considered a potential inversion breakpoint • The statistic was computed for every pair of

inversion breakpoints within a certain distance (200kb - 6Mb)

• 176 inversion polymorphisms were detected

Wi’08 Structure

Inversions with secondary evidence

• 18 regions had highly homologous repeat sequence, 11 with inverted repeats (p-value 0.006 against random distribution of inversion breakpoints)

Wi’08 Structure

Inversions with secondary evidence

Wi’08 Structure

1.4Mb inversion at 16p12(CEU+YRI)

• Supported by fosmid pair evidence (Tuzun et al., 2005)• 80kb long inverted repeat• Identical breakpoints in CEU and YRI data• Known chimp repeat

Wi’08 Structure

1.2Mb Chr 10 (CHB+JPT)

• Inversion supported by fosmid evidence (Tuzun et al. 2005)

Wi’08 Structure

Chr 6 Inversion(YRI) & TCBA1

• 66 inversion breakpoints covered by genes• Less than expected by chance (p-value 0.02)• Many overlapping genes are known to be disrupted in disease• T-cell lymphoma breakpoint associated target (TCBA1)

– Structurally disrupted in T-cell lymphoma cell-lines

Wi’08 Structure

ICAp69

• Islet cell antigen (ICAp69) gene

Wi’08 Structure

Inversion Polymorphism summary

• Many inversions (and copy neutral) polymorphisms remain to be detected.

• The genotype LD signal is useful but has low power.– We are investigating other signals to

improve detection• Some of these variations lead to genes

fusing.

Wi’08 Structure

Conclusions for the class

• Genetic variations are important to our understanding of evolution and genotype-phenotype relationship

Wi’08 Structure

Topics covered1. We started (and finished) by considering sources of

variation2. Models of evolution of population under natural

assumptions1. HW/Linkage equlibrium2. Efficient simulation of populations via coalescent theory

3. Evolution under recombinations/recombination hot-spot detection via counting of recombination events

4. Detection of regions under selection5. Association testing6. Population sub-structure7. Detecting structural variation

Wi’08 Structure

Algorithmic diversions1. Phylogeny reconstruction (perfect phylogeny,

distance-based methods)2. Optimization using (Integer-) Linear programming,

simulated annealing and other paradigms3. Greedy algorithms, dynamic programming4. Stochastic sampling methods (MCMC, Gibbs

sampling)5. Statistical tests for significance6. Efficient simulation techniques

1. Coalescent for populations2. Simulating genotype/phenotype associations

top related