1 bojan basrak department of mathematics, university of zagreb, croatia eva 2005, gothenburg extreme...

38
1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

1

Bojan BasrakDepartment of Mathematics,

University of Zagreb, Croatia

EVA 2005, Gothenburg

EXTREME VALUES, COPULAS AND GENETIC MAPPING

EXTREME VALUES, COPULAS AND GENETIC MAPPING

Page 2: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

2

Genetic mappingGenetic mapping

• Genetic map gives the relative positions of genes on the chromosomes with distances between them typically measured in centimorgans (cM)

• Linkage analysis aims to find approximate location of genes associated with certain traits in plants and animals.

• It is a statistical method that compares genetic similarity between two individuals (at a marker) to similarity of their physical or psychological traits (phenotype).

• Among the most studied traits are inheritable diseases.

Page 3: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

3

QTLQTL

• Quantitative trait: A measurable trait that shows continuous variation, e.g. skin pigmentation, height, cholesterol, etc.

• Quantitative traits are normally influenced by several genes and the environment.

• QTL or quantitative trait locus: a locus (or a gene) affecting quantitative trait.

• There is even The Journal of Quantitative Trait Loci.

Page 4: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

4

• Genetic similarity between two individuals at a given locus is typically measured by a number called identity by descent (IBD) status.

• Two genes of two different people are IBD if one is a physical copy of the other, or if they are both copies of the same ancestral gene.

• For any two people IBD status is a number in the set {0,1,2}. In real-life, this number typically needs to be estimated.

Page 5: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

5

• Linkage analysis is very effective with Mendelian inheritance.

• Mapping genes involved in inheritable diseases can be done by comparing IBD status of affected relatives (e.g. breast cancer)

• Mapping QTLs in animals or plants is performed by arranging a cross between two inbred strains, which are substantially different in a quantitative trait (e.g. tomato fruit mass or pH).

Page 6: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

6

IBD status of two half sibsIBD status of two half sibs

Mother chromosomes Chromosomes of two half sibs

Sib 1

Sib 2

t s

After two meiosis andsome other developments

X(t)=0, X(s)=1

X(t)= number of alleles identical by descent

distancein Morgans

Page 7: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

7

• Recombinations, or more specifically, locations of crossovers in meiosis are frequently modelled by a stochastic process (standard choice is the Poisson process, suggested by Haldane in 1919.)

• The process (X(t)) is an ON-OFF process in the case of half-sibs, or sum of two independent such processes in the case of siblings.

• In particular, under Poisson process model, (X(t)) is a stationary Markov process. Moreover, X(t) is Bernoulli distributed for each t in the case of half sibs.

Page 8: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

8

• In the Haldane model, we have

where

is the recombination probability.• For simplicity, we assume that IBD status is known at

each marker (i.e. markers are completely genetically informative).

Page 9: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

9

• Human genome consists of over 3 10^9 basepairs (in two copies) on 23 chromosomes. The average length of a chromosome is 140 cM.

• Total length of female (autosomal) genome is 4296cM• Total length of male genome is 2851 cM• That is: there is 1 expected crossover over 105 Mb in

males and over 88 Mb in females. Thus, on human genome, 1 cM approximately equals 1Mb.

Page 10: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

10

DataData

• From n sib-pairs we observe

- a sequence of iid phenotypes, with continuous marginal distribution

and

- a sequence of iid processes

Page 11: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

11

IBD 1 at tIBD 0 at t

Page 12: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

12

Haseman-ElstonHaseman-Elston

• In 1972, they suggested to test whether there is a linear regression with negative slope between

• Soon, this became the standard tool for mapping of QTLs in human genetics

Page 13: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

13

Variance Components ModelVariance Components Model

• Variance components model (Fulker and Cherny) essentially assumes that the joint distribution of the phenotypes is • bivariate normal, conditionally on the IBD status x,

with the same marginal distributions, • and the correlation

Page 14: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

14

Linkage AnalysisLinkage Analysis

• The main question: – Does higher IBD status mean stronger dependence

between the two trait values?

In variance components model this translates into the test of Ho :

against HA:

Page 15: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

15

Test statisticTest statistic

• Statistical test is based on the log-likelihood ratio statistic

• Or (equivalently) on the efficient score statistic

Page 16: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

16

• Where

is the score function, and

is appropriate entry of Fisher information matrix and

needs to be estimated in practice.

Page 17: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

17

Z(t)

tmax

Page 18: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

18

Significance in genome-wide scans

Significance in genome-wide scans

• If we have more than one marker we need to deal with the issue of multiple testing. The solution of this problem depends on the intermarker spacings and the sample size.

• One could use permutation tests or other simulation based methods to obtain p-values.

• If the sample size is large, one can apply a nice asymptotic theory that determines significance thresholds from the analysis of extremes of certain Gaussian processes (see. Lander and Botstein, Siegmund et al.)

Page 19: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

19

• For an illustration, we assume that the markers are “dense”, that is IBD status is measured continuously along the genome. It turn’s out that under our assumptions and the null hypothesis one can show that

where is Ornstein-Uhlenbeck process with mean zero and covariance function

over each chromosome.

Page 20: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

20

• Now, approximate thresholds for a given significance level can be obtained by studying extremes of Ornstein-Uhlenbeck process (cf. Leadbetter et al) over finite interval. Hence, we get

• For 23 human chromosomes with average length of 140 cM and significance level 0.05 we get threshold b=4.08 (3.62 on LOD scale).

Page 21: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

22

DisadvantagesDisadvantages

• Normality assumption is frequently questionable• Correlation can be a very bad measure of dependence if

this assumption does not hold

Risch and Zhang (1995) show how"The majority of such pairs provide little power to detect

linkage; only pairs that are concordant for high values, low values, or extremely discordant pairs (for example, one in the top 10 percent and other in the bottom 10 percent of the distribution) provide substantial power"

Page 22: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

23

CopulaCopula

• Copula of a random pair is the distribution function C of the random vector

where we assume that the marginal distributions F1 and F2 of Y1and Y2 are invertible. Hence the marginal distributions of the copula are both uniform on [0,1].

• It is well known that the distribution of a random pair splits into two marginal distributions and the copula. Also copula is invariant under continuous increasing transformations.

Page 23: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

26

Linkage analysis rephrasedLinkage analysis rephrased

• The main question:– Does higher IBD status mean stronger dependence

between the two trait values?

could be rephrased as– Does higher IBD status mean that the two trait

values have “more diagonalized” copula?

Note: marginal distributions do not change with IBD status.

Page 24: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

27

Normal CopulaNormal Copula

• Normal copula is a copula of a normally distributed random vector. Thus, if

then the random vector has the bivariate normal copula.

Since it depends only on we denote it by

Page 25: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

28

Bivariate Normal CopulaBivariate Normal Copula

Page 26: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

29

New ModelNew Model

• Assume that the pair has • the same copula as in the variance components

model, i.e.

conditionally on the IBD status x• and the same (but arbitrary) continuous marginal

distribution i.e. F1 = F2 .

Page 27: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

30

• The model is not so new after all, equivalently, there is an h such that

satisfies the assumption of the v.c. model.• Suppose that has the standard normal

distribution function then

That is

Page 28: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

31

We can proceed in two ways:

a) we could guess (estimate) h, orb) we could guess (estimate) F1

The first method is already frequently applied in practice,

while the second one is easier to justify using the empirical

distribution function of the phenotypes.

To estimate F1 we may use data from a larger sample if

available.

Page 29: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

32

TransformationTransformation

• In practice we might have only 2n sib-pairs to estimate marginal distribution. So we could use

• Transformed phenotypes are

Page 30: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

33

• If , one can show the following

Theorem

as • Observe that we essentially use van der Waerden

normal scores rank correlation coefficient to measure dependence between the traits.

• Klaassen and Wellner (1997) showed that this is asymptotically efficient estimator of the correlation parameter in bivariate normal copula model.

Page 31: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

34

• Hence, it is also efficient estimator of the maximum correlation coefficient.

• For a pair of random variables Y1 and Y2 , maximum correlation coefficient is defined as

where supremum is taken over all real transformations a and b such that a(Y1) and b(Y2) have finite nonzero variance.

Page 32: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

35

Simulation studySimulation study

Page 33: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

36

Application - Lp(a)Application - Lp(a)

• Twin data on lipoprotein levels, collected in 4 populations in three countries (Australia, the Netherlands, Sweden).

• Analysis was performed using the variance components method and published by Beekman et al. (2003).

Page 34: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

37

Ad hoc transformationAd hoc transformation

Page 35: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

38

Lp(a) - chromosome 1Lp(a) - chromosome 1

Page 36: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

39

Lp(a) - chromosome 6Lp(a) - chromosome 6

Page 37: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

40

DiscussionDiscussion

• The normal copula based method has correct critical levels under the null hypothesis for any marginal distribution. Its power seems to be close to optimal.

• The method easily extends to general pedigrees, discrete data, multiple QTLs, etc.

• It is straightforward to implement in any existing software.

• Other families of copulas (Clayton, Gumbel, etc.) could be more suitable in certain applications.

Page 38: 1 Bojan Basrak Department of Mathematics, University of Zagreb, Croatia EVA 2005, Gothenburg EXTREME VALUES, COPULAS AND GENETIC MAPPING

43

AcknowledgmentsAcknowledgments

• C. Klaassen (UvA, Eurandom)• D. Boomsma (VUA)• M. Beekman (LUMC)• N. Martin (Australia)