ronnie a. sebro haplotype reconstruction bmi 374 10/21/2004

46
Ronnie A. Sebro Haplotype reconstruction BMI 374 10/21/2004

Post on 19-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Ronnie A. Sebro

Haplotype reconstruction

BMI 374

10/21/2004

Mendelian Laws of Inheritance

• Law of Segregation– Alleles separate when gametes are formed

• Law of Independent assortment– Allele pairs separate independently during

formation of gametes

• Mendelian Inheritance – Each offspring receives one allele from male

parent, and the other from female parent

Complex Diseases

• Polygenic or multifactorial diseases

• Run in families, but do not show Mendelian (monogenic) inheritance

• Complex interaction between disease susceptibility genes, and environmental factors

• Examples: asthma, schizophrenia

Finding disease genes• Two common methods employed

• Pedigree analysis– Linkage analysis– Affected individuals inherit/share the same

portion of the genome

• Case-control analysis– Association analysis– Affected individuals have different allele

frequencies (higher or lower) than controls

Definitions• Marker – small segments of DNA with specific

features• Types of markers

– SNPs • AATAA vs. AACAA

– Microsatellites (STRs)• -CAGCAGCAG- vs. –CAGCAGCAGCAGCAG-

• Locus - physical position of a marker on a chromosome

• Homozygous – when both alleles at a locus are the same

• Heterozygous – when the alleles at a locus are different

Definitions• Haplotype

– All alleles, one from each locus that are on the same chromosome

• Recombinant – An individual who inherited a haplotype not

identical to that inherited by his/her parent

• Phase– Information about which alleles are inherited

from each parent

Example

• Genotypes • Haplotypes

Enumerating Haplotypes

• Consider an individual heterozygous at 3 loci e.g 1 2 1 2 1 2

• Several possible haplotypes

• Haplotype space can be potentially huge

• For n SNPs – 2n haplotypes

2

1

2

1

2

1

1

2

2

2

1

1

1

1

2

2

2

1

2

2

2

1

1

1

Finding disease genes

• Both tests (association based tests, and pedigree linkage analysis tests) tentatively converge

• Convergence is at the point of requiring to find a haplotype/allele in tight association (LD) or inherited by all affected individuals

• Putative disease locus thereby identified

Why Haplotype?

• Single allele vs. Haplotype

• Advantages of using haplotype– Improved Power !

• Disadvantages of using haplotype– Haplotypes aren’t readily known

Problem

• Data generated from sequencer in the following format (SNPs)

• 1 1 0 0 1 1 1 1 1 1 2 • 1 2 0 0 0 2 2 2 2 1 2 • 1 3 1 2 1 1 2 1 2 1 2 • 1 4 1 2 0 1 2 1 2 2 2

• Genotypes are known• Haplotypes are unknown

• Pedigree

Haplotyping

• Haplotyping can be done at molecular level – whole genome derived haplotypes (ref. Douglas et al., 2001)

• Algorithms preferred because– Lower cost of genotyping– Fast and accurate algorithms

Current Haplotyping Algorithms

• Algorithms used for unphased data

• Clark Algorithm (Andy Clark @ Penn State)

• E-M Algorithm (Stephens et al. )

• Bayesian Haplotype Inference (Jun Liu et al.)

Clark Algorithm• Enumerate haplotypes which exist with

certainty in the sample (individuals heterozygous at 0 or 1 loci)

• Assigns ambiguous haplotypes to those in the known list

• Solutions are dependent on the order in which the individuals with unresolved haplotype phase are entered

• The algorithm does not assume HW equilibrium

E-M Algorithm• Estimate population haplotype

probabilities is via maximum likelihood estimation; finding the values of the haplotype probabilities which optimize the probability of the observed data

• The maximum likelihood estimates of the haplotype probabilities are obtained by maximization of the likelihood

• This is a missing data problemAssumption of HW equilibrium

• Software EH (Xie and Ott, 1993) and EH+ (Zhao, Curtis and Sham)

i Hhh

hPhPL21

)()( 21

Bayesian Algorithm• A dirichlet prior distribution is used for the

haplotype frequencies

• Uses a Gibbs sampler: enables handling of many SNP loci

• Implemented in program HAPLOTYPER

Errata in data

• Genotyping Errors – (quite common esp. with SNPs)

• Missing data– MCAR– MAR– Non-ignorable missingness

• Marker order errors

Overview• Discuss paper dealing with estimation of

haplotypes in pedigrees (i.e. some information about phase)

• Minimum-Recombinant Haplotyping in Pedigrees (Qian & Beckmann)

• Useful for the HAPMAP project!

• Useful also for association analyses with the Transmission Disequilibrium Test (TDT)

Paper 1

• Minimum-Recombinant Haplotyping in Pedigrees– Notation– Methods (Algorithm)– Results– Shortcomings of algorithm

Recombination Principles

• Minimum-Recombination Principle– In nature, recombination is a rare event– The most probable haplotypes are those that

minimize the total number of recombinations needed in the pedigree

• Double-Recombinants– Naturally these are even rarer events,

especially over such short intervals (10cM)

Notation• Consider a pedigree of J family members

and a set of L linked marker loci

• Individual – any family member

• Parent – a family member with at least 1 child

• Founder – a parent without his/her parents

• Offspring – a family member with at least one parent

Notation• Define individual “genotyped” at locus l iff:

– The genotype at locus l is known (from DNA)– The genotype data can be determined from 1º

relatives

• Ungenotyped parent (other genotyped)– Informative if both haplotypes transmitted– Partially informative only one haplotype

transmitted

• Genotyped offspring– Informative if at least one genotyped parent

Notation• Parental source (PS) – allele that is

maternally or paternally derived

• Grandparental source (GS) – the parental source of each parental allele

Notation• For a nuclear family:• denote the alleles

of parent 1• denote the alleles

of parent 2 • denote the alleles

of offspring j• denote the

paternal and maternal alleles of parent 1

• denote the paternal and maternal alleles of parent 2

• denote the paternal and maternal alleles of offspring j

• denote the GS of paternal and maternal alleles of individual j

• denote the minimum and maximum allele values, respectively

llba

lldc

jljl fe

llBA

llDC

ljlj ss ,2,,1,

maxmina,a

Notation• denotes PS-

unknown genotype with alleles a and b

• denotes PS-known haplotype with paternal allele A and maternal allele B

• (ab) = (cd) denotes that genotypes (ab) and (cd) are equal

• (ab) ≠ (cd) denotes that genotypes (ab) and (cd) are not equal

• denotes that allele c is a constituent allele of genotype (ab)

• denotes that allele c is not a constituent allele of genotype (ab)

)(ab

AB

)(abc

)(abc

Flexible Locus

• Type 1• If trio are all

heterozygotes, and at least 1 parent and offspring not haplotyped

Flexible Locus

• Type 2• Two alternative

haplotype assignments at locus l in a founder result in equal number of recombinant offspring

Flexible Locus

• Type 3• If two alternative

haplotype assignments at locus l in offspring result in equal number of recombinants

Rules• Divide pedigree into nuclear trios• Apply rules to each trio until all individuals haplotyped, or

no further inference possible

• Rule 1: Input missing genotype at unambiguous loci in each parent conditional on spouse and child genotypes

• Rule 2: Assign haplotype at each unambiguous ocus in each offspring, conditional on parental genotypes

• Rule 3: Assign haplotypes at each unambiguous locus in each founder, conditional on haplotypes in offspring and the criterion of minimum recombinants in each nuclear family

Rules• Rule 4: Assign haplotypes at each unambiguous locus in

each offspring, conditional on haplotypes in parents and the criterion of minimum recombinants in each trio

• Rule 5: Impute a missing genotype at each unambiguous locus in each parent, conditional on haplotypes in offspring and the criterion of minimum recombinants in each nuclear family

• Rule 6: Locate a locus with at least one individual in a nuclear family that is flexible at this locus, enumerate the haplotype configuration into multiple configurations, retaining all configuration with the minimum recombinants

Implementation

• Raw genotype data

Implementation

• Rule 1:– Impute missing

genotype at each unambiguous locus in each parent, conditional on genotypes in spouse and offspring

Implementation

• Rule 2: – Assign a haplotype at

each unambiguous locus in each offspring, conditional on genotypes in parents in each parent-offspring trio

Implementation

• Rule 3: Assign haplotypes at each unambiguous locus in each founder, conditional on haplotypes in offspring and the criterion of minimum recombinants in each nuclear family

Implementation

• Rule 4: Assign haplotypes at each unambiguous locus in each offspring, conditional on haplotypes in parents and the criterion of minimum recombinants in each trio

Implementation

• Rule 5: Impute a missing genotype at each unambiguous locus in each parent, conditional on the haplotypes in offspring and the criterion of minimum recombinants in each nuclear family

Implementation

• Second application of rules 2 and 3

Implementation• Rule 6: Locate a locus with

at least one individual in a nuclear family that is flexible at this locus, enumerate the haplotype configuration into multiple configurations with alternative haplotype assignments at each flexible locus in these individuals.

• Retain all configurations with the minimum recombinants

• Reapplication of rule 3

Results• A pedigree with Episodic ataxia

– 29 total individuals – Genotyped at 9 polymorphic markers– 2 individuals not genotyped

• Simulation study

• Looped marriage structure in a pedigree with ataxia telangiecstasia

Results• High degree of concordance with the

maximum-likelihood method

• Identical haplotype configuration obtained with GENEHUNTER (ML based) in >99% of pedigrees analyzed.

Simulation ResultsData set

(FSize,Loci,Rec,T)

Both Rule-Based

alone

Fewer recombinants

(15,10,4,STR) 90 2 8

(15,25,4,STR) 94 0 6

(15,50,4,STR) 91 7 2

(29,10,4,STR) 86 1 13

(29,25,4,STR) 91 1 8

(29,50,4,STR) 89 0 11

(15,10,4,SNP) 82 9 9

(29,10,4,SNP) 76 8 16

(15,10,0,SNP) 100 0 0

(29,10,0,SNP) 97 3 0

(17,10,0,Loop) 98 2 0

Genotype Errors• Impact of genotype errors investigated• Generated genotype data on 1000 pedigrees,

each pedigree containing one incorrect allele in a random individual at a random marker

• Mean number of recombinants increased from 5 to 6.2 (1.2)

• 44% of these additional recombinants were double recombinants

• All four correct MRHCs were reconstructed in 84% of pedigrees

Marker errors• The consequence of incorrect marker order on

imputing haplotypes was investigated• Marker loci 2-7 (of the 9 loci involved for the EA

study) were permuted (6! -1 ways)• Of the 719 orderings

– None produced MRHCs with fewer than 5 recombinants

– Only 5% had the same number of recombinants as the correct ordering

– Chances of recovering all four MRHCs was 20% and 0% when 2 and 6 marker loci were incorrectly ordered

Conclusions

• Both genotype errors and incorrect marker order can produce additional recombinants in reconstructing haplotypes

• Sensitivity analyses suggest that incorrect marker orderings may have a larger impact than genotyping errors

Conclusions• This haplotyping method is applicable to

both STRs and SNP data

• Total computational requirement due to enumeration in a pedigree with J family members and L loci is on the order O(J2L3)

• Computational requirements for SNP data are 3-10 times larger than for STRs (more flexible loci)

Shortcomings• A genotyped individual with neither

genotyped parents nor genotyped offspring cannot be analyzed in this algorithm

• Same problem above, even if multiple siblings and other relatives are genotyped

• Likelihood-based methods are able to assign haplotypes to individuals who are uninformative using this rule-based method