ronnie a. sebro haplotype reconstruction bmi 374 10/21/2004
Post on 19-Dec-2015
215 views
TRANSCRIPT
Mendelian Laws of Inheritance
• Law of Segregation– Alleles separate when gametes are formed
• Law of Independent assortment– Allele pairs separate independently during
formation of gametes
• Mendelian Inheritance – Each offspring receives one allele from male
parent, and the other from female parent
Complex Diseases
• Polygenic or multifactorial diseases
• Run in families, but do not show Mendelian (monogenic) inheritance
• Complex interaction between disease susceptibility genes, and environmental factors
• Examples: asthma, schizophrenia
Finding disease genes• Two common methods employed
• Pedigree analysis– Linkage analysis– Affected individuals inherit/share the same
portion of the genome
• Case-control analysis– Association analysis– Affected individuals have different allele
frequencies (higher or lower) than controls
Definitions• Marker – small segments of DNA with specific
features• Types of markers
– SNPs • AATAA vs. AACAA
– Microsatellites (STRs)• -CAGCAGCAG- vs. –CAGCAGCAGCAGCAG-
• Locus - physical position of a marker on a chromosome
• Homozygous – when both alleles at a locus are the same
• Heterozygous – when the alleles at a locus are different
Definitions• Haplotype
– All alleles, one from each locus that are on the same chromosome
• Recombinant – An individual who inherited a haplotype not
identical to that inherited by his/her parent
• Phase– Information about which alleles are inherited
from each parent
Enumerating Haplotypes
• Consider an individual heterozygous at 3 loci e.g 1 2 1 2 1 2
• Several possible haplotypes
• Haplotype space can be potentially huge
• For n SNPs – 2n haplotypes
2
1
2
1
2
1
1
2
2
2
1
1
1
1
2
2
2
1
2
2
2
1
1
1
Finding disease genes
• Both tests (association based tests, and pedigree linkage analysis tests) tentatively converge
• Convergence is at the point of requiring to find a haplotype/allele in tight association (LD) or inherited by all affected individuals
• Putative disease locus thereby identified
Why Haplotype?
• Single allele vs. Haplotype
• Advantages of using haplotype– Improved Power !
• Disadvantages of using haplotype– Haplotypes aren’t readily known
Problem
• Data generated from sequencer in the following format (SNPs)
• 1 1 0 0 1 1 1 1 1 1 2 • 1 2 0 0 0 2 2 2 2 1 2 • 1 3 1 2 1 1 2 1 2 1 2 • 1 4 1 2 0 1 2 1 2 2 2
• Genotypes are known• Haplotypes are unknown
• Pedigree
Haplotyping
• Haplotyping can be done at molecular level – whole genome derived haplotypes (ref. Douglas et al., 2001)
• Algorithms preferred because– Lower cost of genotyping– Fast and accurate algorithms
Current Haplotyping Algorithms
• Algorithms used for unphased data
• Clark Algorithm (Andy Clark @ Penn State)
• E-M Algorithm (Stephens et al. )
• Bayesian Haplotype Inference (Jun Liu et al.)
Clark Algorithm• Enumerate haplotypes which exist with
certainty in the sample (individuals heterozygous at 0 or 1 loci)
• Assigns ambiguous haplotypes to those in the known list
• Solutions are dependent on the order in which the individuals with unresolved haplotype phase are entered
• The algorithm does not assume HW equilibrium
E-M Algorithm• Estimate population haplotype
probabilities is via maximum likelihood estimation; finding the values of the haplotype probabilities which optimize the probability of the observed data
• The maximum likelihood estimates of the haplotype probabilities are obtained by maximization of the likelihood
• This is a missing data problemAssumption of HW equilibrium
• Software EH (Xie and Ott, 1993) and EH+ (Zhao, Curtis and Sham)
i Hhh
hPhPL21
)()( 21
Bayesian Algorithm• A dirichlet prior distribution is used for the
haplotype frequencies
• Uses a Gibbs sampler: enables handling of many SNP loci
• Implemented in program HAPLOTYPER
Errata in data
• Genotyping Errors – (quite common esp. with SNPs)
• Missing data– MCAR– MAR– Non-ignorable missingness
• Marker order errors
Overview• Discuss paper dealing with estimation of
haplotypes in pedigrees (i.e. some information about phase)
• Minimum-Recombinant Haplotyping in Pedigrees (Qian & Beckmann)
• Useful for the HAPMAP project!
• Useful also for association analyses with the Transmission Disequilibrium Test (TDT)
Paper 1
• Minimum-Recombinant Haplotyping in Pedigrees– Notation– Methods (Algorithm)– Results– Shortcomings of algorithm
Recombination Principles
• Minimum-Recombination Principle– In nature, recombination is a rare event– The most probable haplotypes are those that
minimize the total number of recombinations needed in the pedigree
• Double-Recombinants– Naturally these are even rarer events,
especially over such short intervals (10cM)
Notation• Consider a pedigree of J family members
and a set of L linked marker loci
• Individual – any family member
• Parent – a family member with at least 1 child
• Founder – a parent without his/her parents
• Offspring – a family member with at least one parent
Notation• Define individual “genotyped” at locus l iff:
– The genotype at locus l is known (from DNA)– The genotype data can be determined from 1º
relatives
• Ungenotyped parent (other genotyped)– Informative if both haplotypes transmitted– Partially informative only one haplotype
transmitted
• Genotyped offspring– Informative if at least one genotyped parent
Notation• Parental source (PS) – allele that is
maternally or paternally derived
• Grandparental source (GS) – the parental source of each parental allele
Notation• For a nuclear family:• denote the alleles
of parent 1• denote the alleles
of parent 2 • denote the alleles
of offspring j• denote the
paternal and maternal alleles of parent 1
• denote the paternal and maternal alleles of parent 2
• denote the paternal and maternal alleles of offspring j
• denote the GS of paternal and maternal alleles of individual j
• denote the minimum and maximum allele values, respectively
llba
lldc
jljl fe
llBA
llDC
ljlj ss ,2,,1,
maxmina,a
Notation• denotes PS-
unknown genotype with alleles a and b
• denotes PS-known haplotype with paternal allele A and maternal allele B
• (ab) = (cd) denotes that genotypes (ab) and (cd) are equal
• (ab) ≠ (cd) denotes that genotypes (ab) and (cd) are not equal
• denotes that allele c is a constituent allele of genotype (ab)
• denotes that allele c is not a constituent allele of genotype (ab)
)(ab
AB
)(abc
)(abc
Flexible Locus
• Type 1• If trio are all
heterozygotes, and at least 1 parent and offspring not haplotyped
Flexible Locus
• Type 2• Two alternative
haplotype assignments at locus l in a founder result in equal number of recombinant offspring
Flexible Locus
• Type 3• If two alternative
haplotype assignments at locus l in offspring result in equal number of recombinants
Rules• Divide pedigree into nuclear trios• Apply rules to each trio until all individuals haplotyped, or
no further inference possible
• Rule 1: Input missing genotype at unambiguous loci in each parent conditional on spouse and child genotypes
• Rule 2: Assign haplotype at each unambiguous ocus in each offspring, conditional on parental genotypes
• Rule 3: Assign haplotypes at each unambiguous locus in each founder, conditional on haplotypes in offspring and the criterion of minimum recombinants in each nuclear family
Rules• Rule 4: Assign haplotypes at each unambiguous locus in
each offspring, conditional on haplotypes in parents and the criterion of minimum recombinants in each trio
• Rule 5: Impute a missing genotype at each unambiguous locus in each parent, conditional on haplotypes in offspring and the criterion of minimum recombinants in each nuclear family
• Rule 6: Locate a locus with at least one individual in a nuclear family that is flexible at this locus, enumerate the haplotype configuration into multiple configurations, retaining all configuration with the minimum recombinants
Implementation
• Rule 1:– Impute missing
genotype at each unambiguous locus in each parent, conditional on genotypes in spouse and offspring
Implementation
• Rule 2: – Assign a haplotype at
each unambiguous locus in each offspring, conditional on genotypes in parents in each parent-offspring trio
Implementation
• Rule 3: Assign haplotypes at each unambiguous locus in each founder, conditional on haplotypes in offspring and the criterion of minimum recombinants in each nuclear family
Implementation
• Rule 4: Assign haplotypes at each unambiguous locus in each offspring, conditional on haplotypes in parents and the criterion of minimum recombinants in each trio
Implementation
• Rule 5: Impute a missing genotype at each unambiguous locus in each parent, conditional on the haplotypes in offspring and the criterion of minimum recombinants in each nuclear family
Implementation• Rule 6: Locate a locus with
at least one individual in a nuclear family that is flexible at this locus, enumerate the haplotype configuration into multiple configurations with alternative haplotype assignments at each flexible locus in these individuals.
• Retain all configurations with the minimum recombinants
• Reapplication of rule 3
Results• A pedigree with Episodic ataxia
– 29 total individuals – Genotyped at 9 polymorphic markers– 2 individuals not genotyped
• Simulation study
• Looped marriage structure in a pedigree with ataxia telangiecstasia
Results• High degree of concordance with the
maximum-likelihood method
• Identical haplotype configuration obtained with GENEHUNTER (ML based) in >99% of pedigrees analyzed.
Simulation ResultsData set
(FSize,Loci,Rec,T)
Both Rule-Based
alone
Fewer recombinants
(15,10,4,STR) 90 2 8
(15,25,4,STR) 94 0 6
(15,50,4,STR) 91 7 2
(29,10,4,STR) 86 1 13
(29,25,4,STR) 91 1 8
(29,50,4,STR) 89 0 11
(15,10,4,SNP) 82 9 9
(29,10,4,SNP) 76 8 16
(15,10,0,SNP) 100 0 0
(29,10,0,SNP) 97 3 0
(17,10,0,Loop) 98 2 0
Genotype Errors• Impact of genotype errors investigated• Generated genotype data on 1000 pedigrees,
each pedigree containing one incorrect allele in a random individual at a random marker
• Mean number of recombinants increased from 5 to 6.2 (1.2)
• 44% of these additional recombinants were double recombinants
• All four correct MRHCs were reconstructed in 84% of pedigrees
Marker errors• The consequence of incorrect marker order on
imputing haplotypes was investigated• Marker loci 2-7 (of the 9 loci involved for the EA
study) were permuted (6! -1 ways)• Of the 719 orderings
– None produced MRHCs with fewer than 5 recombinants
– Only 5% had the same number of recombinants as the correct ordering
– Chances of recovering all four MRHCs was 20% and 0% when 2 and 6 marker loci were incorrectly ordered
Conclusions
• Both genotype errors and incorrect marker order can produce additional recombinants in reconstructing haplotypes
• Sensitivity analyses suggest that incorrect marker orderings may have a larger impact than genotyping errors
Conclusions• This haplotyping method is applicable to
both STRs and SNP data
• Total computational requirement due to enumeration in a pedigree with J family members and L loci is on the order O(J2L3)
• Computational requirements for SNP data are 3-10 times larger than for STRs (more flexible loci)
Shortcomings• A genotyped individual with neither
genotyped parents nor genotyped offspring cannot be analyzed in this algorithm
• Same problem above, even if multiple siblings and other relatives are genotyped
• Likelihood-based methods are able to assign haplotypes to individuals who are uninformative using this rule-based method