![Page 1: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/1.jpg)
Lecture 7.0 1
The informatics of SNPs and haplotypes
Gabor T. Marth
Department of Biology, Boston [email protected]
CGDN Bioinformatics WorkshopJune 25, 2007
![Page 2: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/2.jpg)
Lecture 7.0 2
Why do we care about variations?
underlie phenotypic differences
cause inherited diseases
allow tracking ancestral human history
![Page 3: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/3.jpg)
Lecture 7.0 3
How do we find sequence variations?
• look at multiple sequences from the same genome region
• use base quality values to decide if mismatches are true polymorphisms or sequencing errors
![Page 4: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/4.jpg)
Lecture 7.0 4
Steps of SNP discovery
Sequence clustering
Cluster refinement
Multiple alignment
SNP detection
![Page 5: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/5.jpg)
Lecture 7.0 5
Computational SNP mining – PolyBayes
2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors
sequencing errortrue polymorphism
1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources
Two innovative ideas:
![Page 6: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/6.jpg)
Lecture 7.0 6
SNP mining steps – PolyBayes
sequence clustering simplifies to database search with genome reference
paralog filtering by counting mismatches weighed by quality values
multiple alignment by anchoring fragments to genome reference
SNP detection by differentiating true polymorphism from sequencing error using quality values
![Page 7: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/7.jpg)
Lecture 7.0 7
genome reference sequence
1. Fragment recruitment
(database search)2. Anchored alignment
3. Paralog identificatio
n
4. SNP detection
SNP discovery with PolyBayes
![Page 8: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/8.jpg)
Lecture 7.0 8
Polymorphism discovery SW
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11
Marth et al. Nature Genetics 1999
![Page 9: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/9.jpg)
Lecture 7.0 9
Genotyping by sequence
• SNP discovery usually deals with single-stranded (clonal) sequences
• It is often necessary to determine the allele state of individuals at known polymorphic locations
• Genotyping usually involves double-stranded DNA the possibility of heterozygosity exists
• there is no unique underlying nucleotide, no meaningful base quality value, hence statistical methods of SNP discovery do not apply
![Page 10: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/10.jpg)
Lecture 7.0 10
Het detection = Diploid base calling
Homozygous T
Homozygous C
Heterozygous C/T Automated detection of heterozygous positions in diploid individual samples
![Page 11: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/11.jpg)
Lecture 7.0 11
Large SNP mining projects
Sachidanandam et al. Nature 2001
~ 8 million
EST
WGS
BAC
genome reference
![Page 12: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/12.jpg)
Lecture 7.0 12
Variation structure is heterogeneous
chromosomal averages
polymorphism density along chromosomes
![Page 13: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/13.jpg)
Lecture 7.0 13
What explains nucleotide diversity?
5
6
7
8
30 33 36 39 42 45 48 51 54
G+C Content [%]
SN
P R
ate
[per
10,
000
bp
]
5
6
7
8
0.3 1.2 2.1 3 3.9 4.8 5.7
CpG Content [%]
SN
P R
ate
[p
er
10,0
00 b
p]
G+C nucleotide content
CpG di-nucleotide content
5
6
7
8
9
10
0 0.5 1 1.5 2 2.5 3 3.5 4
Recombination rate [per Mb]
SN
P R
ate
[per
10,
000
bp
] recombination rate
functional constraints
3’ UTR 5.00 x 10-4
5’ UTR 4.95 x 10-4
Exon, overall 4.20 x 10-4
Exon, coding 3.77 x 10-4
synonymous 366 / 653non-synonymous 287 / 653
Variance is so high that these quantities are poor predictors of nucleotide diversity in local regions hence random processes are likely to govern the basic shape of the genome variation landscape (random) genetic drift
![Page 14: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/14.jpg)
Lecture 7.0 14
Where do variations come from?
• sequence variations are the result of mutation events TAAAAAT
TAACAAT
TAAAAAT TAAAAAT TAACAAT TAACAAT TAACAAT
TAAAAAT TAACAAT
TAAAAAT
MRCA• mutations are propagated down through generations
• and determine present-day variation patterns
![Page 15: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/15.jpg)
Lecture 7.0 15
Neutrality vs. selection
• selective mutations influence the genealogy itself; in the case of neutral mutations the processes of mutation and genealogy are decoupled
functional constraints
3’ UTR 5.00 x 10-4
5’ UTR 4.95 x 10-4
Exon, overall 4.20 x 10-4
Exon, coding 3.77 x 10-4
synonymous 366 / 653non-synonymous 287 / 653
• the genome shows signals of selection but on the genome scale, neutral effects dominate
![Page 16: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/16.jpg)
Lecture 7.0 16
Mutation rate
accgttatgtaga accgctatgtaga
MRCA
actgttatgtaga accgctatataga
MRCA
• higher mutation rate (µ) gives rise to more SNPS
5
6
7
8
0.3 1.2 2.1 3 3.9 4.8 5.7
CpG Content [%]
SN
P R
ate
[p
er
10,0
00 b
p]
• there is evidence for regional differences in observed mutation rates in the genomeCpG content
SN
P d
en
sity
![Page 17: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/17.jpg)
Lecture 7.0 17
Long-term demography
small (effective) population size N
large (effective)
population size N
• different world populations have varying long-term effective population sizes (e.g. African N is larger than European)
![Page 18: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/18.jpg)
Lecture 7.0 18
Population subdivision
unique unique
shared
• geographically subdivided populations will have differences between their respective variation structures
![Page 19: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/19.jpg)
Lecture 7.0 19
Recombination
acggttatgtaga accgttatgtaga
accgttatgtaga
acggttatgtaga
acggttatgtaga
acggttatgtaga
accgttatgtaga
accgttatgtaga
accgttatgtaga
acggttatgtaga
![Page 20: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/20.jpg)
Lecture 7.0 20
Recombination
acggttatgtaga accgttatgtaga
accgttatgtaga
acggttatgtaga
acggttatgtaga
acggttatgtaga
acggttatgtaga
acggttatgtaga
acggttatgtaga
acggttatgtaga
accgttatgtaga
accgttatgtaga
accgttatgtaga
• recombination has a crucial effect on the association between different alleles
![Page 21: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/21.jpg)
Lecture 7.0 21
Modeling genetic drift: Genealogy
present generation
randomly mating population, genealogy evolves in a non-deterministic fashion
![Page 22: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/22.jpg)
Lecture 7.0 22
Modeling genetic drift: Mutation
mutation randomly “drift”: die out, go to higher frequency or get fixed
![Page 23: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/23.jpg)
Lecture 7.0 23
Modulators: Natural selection
negative (purifying) selection
positive selection
the genealogy is no longer independent of (and hence cannot be decoupled from) the mutation process
![Page 24: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/24.jpg)
Lecture 7.0 24
Modeling ancestral processes
“forward simulations” the “Coalescent” process
By focusing on a small sample, complexity of the relevant part of the ancestral process is greatly reduced. There are,
however, limitations.
![Page 25: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/25.jpg)
Lecture 7.0 25
Models of demographic history
past
present
stationary expansioncollapse
MD(simulation)
AFS(direct form)
histo
ry
0
0.05
0.1
1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
1 2 3 4 5 6 7 8 9 100
0.05
0.1
1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
1 2 3 4 5 6 7 8 9 10
bottleneck
0
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 10
![Page 26: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/26.jpg)
Lecture 7.0 26
1. marker density (MD): distribution of number of SNPs in pairs of sequences
Data: polymorphism distributions
0
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 10
“rare” “common”
2. allele frequency spectrum (AFS): distribution of SNPs according to allele frequency in a set of samples
0
0.05
0.1
1 2 3 4 5 6 7 8 9 10
Clone 1 Clone 2 # SNPs
AL00675 AL00982 8
AS81034 AK43001 0
CB00341 AL43234 2
SNP Minor allele Allele count
A/G A 1
C/T T 9
A/G G 3
![Page 27: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/27.jpg)
Lecture 7.0 27
Model: processes that generate SNPs
k
ii
LLL
k
k
ii
LL
k
ii
LL
k
k
iiLL
k
i
ii
i
eL
L
L
eeL
L
L
eL
L
LkP
1!
111
3
3
3
1!
11
1!
11
2
2
2
1!
11
1
1
1
23
21
3
12
2211
212
12
221
2
12
11
1111
111
1
1111
1
1111
1
computable formulations
simulation procedures
3/5 1/5 2/5
![Page 28: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/28.jpg)
Lecture 7.0 28
Models of demographic history
past
present
stationary expansioncollapse
MD(simulation)
AFS(direct form)
histo
ry
0
0.05
0.1
1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
1 2 3 4 5 6 7 8 9 100
0.05
0.1
1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
1 2 3 4 5 6 7 8 9 10
bottleneck
0
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 10
![Page 29: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/29.jpg)
Lecture 7.0 29
0.005.00
10.0015.00
20.0025.00
30.0035.00
40.00
4 kb4 kb
8 kb8kb
12 kb12 kb
16 kb16kb0
0.1
0.2
0.3
0.4
• best model is a bottleneck shaped population size history
presentN1=6,000T1=1,200 gen.
N2=5,000T2=400 gen.
N3=11,000
Data fitting: marker density
Marth et al. PNAS 2003
• our conclusions from the marker density data are confounded by the unknown ethnicity of the public genome sequence we looked at allele frequency data from ethnically defined samples
![Page 30: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/30.jpg)
Lecture 7.0 30
0
0.05
0.1
0.15
1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
0.15
1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
0.15
1 2 3 4 5 6 7 8 9 10
presentN1=20,000T1=3,000 gen.
N2=2,000T2=400 gen.
N3=10,000
model consensus: bottleneck
Data fitting: allele frequency
• Data from other populations?
![Page 31: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/31.jpg)
Lecture 7.0 31
Population specific demographic history
0
0.05
0.1
0.15
1 2 3 4 5 6 7 8 9 10
minor allele count
0
0.05
0.1
0.15
1 2 3 4 5 6 7 8 9 10
minor allele count
0
0.05
0.1
0.15
1 2 3 4 5 6 7 8 9 10
minor allele count
European data
African data
bottleneck
modest but uninterrupted
expansionMarth et al.
Genetics 2004
![Page 32: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/32.jpg)
Lecture 7.0 32
Model-based prediction
computational model encapsulating what we know about the processgenealogy + mutationsallele structure
arbitrary number of additional replicates
![Page 33: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/33.jpg)
Lecture 7.0 33
African dataEuropean data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Pro
port
ion
of A
FS
Mutational Size (i)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Pro
port
ion
of A
FS
Mutational size (i)
contribution of the past to
alleles in various
frequency classes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
20,000
40,000
60,000
80,000
Mut
atio
nal A
ge (
gene
ratio
ns)
Mutational Size (i)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
20,000
40,000
60,000
80,000
Mut
atio
nal A
ge (
gener
atio
ns)
Mutational Size (i)
average age of
polymorphism
Prediction – allele frequency and age
![Page 34: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/34.jpg)
Lecture 7.0 34
How to use markers to find disease?
![Page 35: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/35.jpg)
Lecture 7.0 35
Allelic association
• allelic association is the non-random assortment between alleles i.e. it measures how well knowledge of the allele state at one site permits prediction at another marker site functional site
• by necessity, the strength of allelic association is measured between markers
• significant allelic association between a marker and a functional site permits localization (mapping) even without having the functional site in our collection
• there are pair-wise and multi-locus measures of association
![Page 36: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/36.jpg)
Lecture 7.0 36
Linkage disequilibrium
• LD measures the deviation from random assortment of the alleles at a pair of polymorphic sites
D=f( ) – f( ) x f( )
• other measures of LD are derived from D, by e.g. normalizing according to allele frequencies (r2)
![Page 37: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/37.jpg)
Lecture 7.0 37
strong association: most chromosomes carry one of a few common haplotypes – reduced haplotype diversity
Haplotype diversity
• the most useful multi-marker measures of associations are related to haplotype diversity
2n possible haplotypesn
markers
random assortment of alleles at different sites
![Page 38: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/38.jpg)
Lecture 7.0 38
Haplotype blocks
Daly et al. Nature Genetics 2001
• experimental evidence for reduced haplotype diversity (mainly in European samples)
![Page 39: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/39.jpg)
Lecture 7.0 39
The promise for medical genetics
CACTACCGACACGACTATTTGGCGTAT
• within blocks a small number of SNPs are sufficient to distinguish the few common haplotypes significant marker reduction is possible
• if the block structure is a general feature of human variation structure, whole-genome association studies will be possible at a reduced genotyping cost • this motivated the HapMap project
Gibbs et al. Nature 2003
![Page 40: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/40.jpg)
Lecture 7.0 40
The HapMap initiative
• goal: to map out human allele and association structure of at the kilobase scale
• deliverables: a set of physical and informational reagents
![Page 41: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/41.jpg)
Lecture 7.0 41
HapMap physical reagents
• reference samples: 4 world populations, ~100 independent chromosomes from each
• SNPs: computational candidates where both alleles were seen in multiple chromosomes
• genotypes: high-accuracy assays from various platforms; fast public data release
![Page 42: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/42.jpg)
Lecture 7.0 42
Informational: haplotypes
• the problem: the substrate for genotyping is diploid, genomic DNA; phasing of alleles at multiple loci is in general not possible with certainty
• experimental methods of haplotype determination (single-chromosome isolation followed by whole-genome PCR amplification, radiation hybrids, somatic cell hybrids) are expensive and laborious
A
T
C
T
G
C
C
A
![Page 43: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/43.jpg)
Lecture 7.0 43
Haplotype inference
• Parsimony approach: minimize the number of different haplotypes that explains all diploid genotypes in the sampleClark
Mol Biol Evol 1990
• Maximum likelihood approach: estimate haplotype frequencies that are most likely to produce observed diploid genotypes Excoffier & Slatkin
Mol Biol Evol 1995
• Bayesian methods: estimate haplotypes based on the observed diploid genotypes and the a priori expectation of haplotype patterns informed by Population Genetics Stephens et al.
AJHG 2001
![Page 44: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/44.jpg)
Lecture 7.0 44
Haplotype inference
http://pga.gs.washington.edu/
![Page 45: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/45.jpg)
Lecture 7.0 45
Haplotype annotations – LD based
• Pair-wise LD-plots
Wall & Pritchard Nature Rev Gen 2003
• LD-based multi-marker block definitions requiring strong pair-wise LD between all pairs in block
![Page 46: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/46.jpg)
Lecture 7.0 46
Annotations – haplotype blocks
• Dynamic programming approachZhang et al.
AJHG 2001
3 3 3
1. meet block definition based on common haplotype requirements
2. within each block, determine the number of SNPs that distinguishes common haplotypes (htSNPs)
3. minimize the total number of htSNPs over complete region including all blocks
![Page 47: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/47.jpg)
Lecture 7.0 47
Haplotype tagging SNPs (htSNPs)
Find groups of SNPs such that each possible pair is in strong LD (above threshold).
CarlsonAJHG 2005
![Page 48: Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu CGDN Bioinformatics Workshop June](https://reader035.vdocuments.site/reader035/viewer/2022062500/5697bfa41a28abf838c973f5/html5/thumbnails/48.jpg)
Lecture 7.0 48
http://bioinformatics.bc.edu/marthlab