the ultimate genotyping experiment: … the ultimate genotyping experiment: determination of human...
TRANSCRIPT
1
The Ultimate Genotyping Experiment:
Determination of Human DNA Sequences
Dept. of MCD Biology
Institute for Behavioral Genetics
Center for Adolescent Drug Dependence
University of Colorado
2
Overview
Why DNA sequencing will become the tool of
choice to study genotype/phenotype
relationships for heritable traits
What is the current technology that makes it work
How to make whole genome sequencing affordable
for large-scale genetic studies
3
State of the art: Association studies via
Genome Wide Association
Strategy: survey 105 - 106 well-known single
nucleotide polymorphisms (SNPS) in large
populations - score for co-variation with trait
– “Skims” genetic variation and can allow
correlation with trait of interest
– Only 5-10% of the ~10 million common SNPs
– Based on “common allele, common disease”
model
4
When it is good, it is very, very good
• Hundreds of successful GWAS studies for
several phenotypes (i.e. diabetes,
hypertension, asthma, height, obesity)
• Depends alot on “power” which is
proportional to the number of people
studied
• Also depends critically on phenotype
5
Example: Blood lipids
• In an analysis by a group lead by Gonzalo
Abecasis U. of Michigan
– combined 41 samples, >100,000 genotypes
– Phenotype: Fasting lipids
(LDL,HDL,Triglycer.)
– No medicated people studied
– 2.5 x 106 SNP (typed+imputed); MAF >=1%
• Identified 95 loci that associate with levels
6
How good is this? Very, very
good.
• OMIM had reported 18 genes affecting lipids
– 15 of them within 100kb of GWAS hit
– 8 within 10kb
• Computer simulations of alleles randomized
averaged <1 within 100kb and not 1/106
simulations had more than 8.
7
Does it mean anything?
• One GWAS allele (40% frequency) was
found to be in a GALNT2 (glycosylation)
• Allele causes only +/- 1mg/dl HDL-C
Teslovich et al, in press
8
Does it mean anything?
• One GWAS allele (40% MAF) was found
to be in a GALNT2 (glycosylation)
• Allele causes only +/- 1mg/dl HDL-C
• In mouse--
– Overexpression decreases HDL-C ~20%
– Knockdown increases HDL-C ~30%
• So clearly this gene, that had no known
role in lipid metabolism, CAN be important
9
But when it is bad, it is horrid
Successful studies can account for only a
fraction of the genetic influence on
phenotypic variance for most behavioral
traits despite high heritability – why?
• Genes with high-influence may be lacking
• Phenotypes inappropriately defined
• Insufficient N
• Inability to study rare variants
meaningfully
10
Sequencing can do genotyping
ALOT better
• Whole genome sequencing types ALL
polymorphisms - rare and common
• GWAS done with sequencing has no
“missing” data like chip-based methods
• Linkage (and LD) are not required for
association. More “straighforward”
analysis
• May eventually be cheaper per marker
11
Sharpening tools
Efforts to begin large scale DNA sequencing
to increase power to detect genes are now
being piloted
1000 Genomes Project is pointing the way
– Moderate frequency alleles – low pass
sequencing (low cost/person)
– Rare variants – deep sequencing (high cost)
12
Digression on Sequencing
• Rates of acquisition of DNA sequence
have gone through the roof
• Accuracy improves and costs have
plummeted
• Most of the progress due to determination
of reference human sequence combined
with technological advances in short read
technologies
13
Sequencing is affordable!
13
14
Sequencing is affordable!
14
15
Sequencing is affordable!
15
Publications by Year using Illumina
Sequencing Methods
2007 2008 2009 2010
16
Three prominent technolgies
• Illumina - based on solid phase PCR
approach (similar to SoLID system)
• 454 - similarly based on solid phase DNA
synthesis using highly processive process
• PacBio - based on true single-molecule
detection approach - MOST processive
16
17
Illumina
17
18
Illumina
18
19
Illumina
19
20
Roche - 454
21
PacBio Single molecule of
DNA at a time
75,000 reads
simultaneously
1000-10000/read
22
What is good for what?
22
454 SOLID ILLUMINA PACBIO
Method DNA Pol
synthesis
Ligase PCR with fluor.
dNTPs
DNA Pol
special NTPs
Medium Beads Beads Glass surface Optical well
Error
types
Indels at
homopol.
End errors End errors Random indels
Bases/re
ad
400-1000 50 100+100 >1000
Most
common
use
Metagenomics Resequencing
/de novo
Resequencing/
de novo
de novo
microbial
genomes
23
What can these babies do?
• Illumina,
SoLID, 454
can deal
with most of
these
methods
23
Tag profiling
Small RNA Discovery
mRNA-Seq Methylation
Targeted
Resequencing
CNV
DNase I
Hypersensitivity
Metagenomics
ChIP-Seq
ChIA-PET
Bacterial Sequencing
Human Genome
Resequencing
Nucleosome Mapping Molecular
Cytogenetics
De novo
Sequencing
24
What can these babies do?
• Illumina HiSeq 2000 - popular for genotyping
24
HiSeq 2000
Readlength 2X100
Yield / run 250 Gb
Runs / genome 1/2
Depth 54.9x
SNPs 4,232,886
550k GT coverage 99.8%
Genotype concordance 99.3%
25
How are bases called given
finite errors? • One can sequence many times (i.e. 5x coverage
25
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’
Reference Genome
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG
ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC
TAGCTGATAGCTAGATAGCTGATGAGCCCGAT
Sequence Reads
Predicted Genotype ?
26
How are bases called given
finite errors?
26
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’
Reference Genome
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG
ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC
TAGCTGATAGCTAGATAGCTGATGAGCCCGAT
Sequence Reads
P(reads|A/A , read mapped)= 0.00000098
P(reads|A/C , read mapped)= 0.03125
P(reads|C/C , read mapped)= 0.000097
27
How are bases called
27
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ Reference Genome
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG
ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC
TAGCTGATAGCTAGATAGCTGATGAGCCCGAT
Sequence Reads
Individual Based Prior: Every site has 1/1000 probability of varying.
P(reads|A/A)= 0.00000098 Prior(A/A) = 0.00034 Posterior(A/A) = <.001
P(reads|A/C)= 0.03125 Prior(A/C) = 0.00066 Posterior(A/C) = 0.175
P(reads|C/C)= 0.000097 Prior(C/C) = 0.99900 Posterior(C/C) = 0.825
28
How are bases called
28
• Individual Based Prior
• Assumes all sites have an equal probability of showing polymorphism
• Specifically, assumption is that about 1/1000 bases differ from reference
• If reads where error free and sampling Poisson …
• … 14x coverage would allow for 99.8% genotype accuracy
• … 30x coverage of the genome needed to allow for errors and clustering
29
What if....
29
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ Reference Genome
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG
ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC
TAGCTGATAGCTAGATAGCTGATGAGCCCGAT
Sequence Reads
Population Based Prior: Use frequency information from examining others at the same site. In the example above, we estimated P(A) = 0.20
P(reads|A/A)= 0.00000098 Prior(A/A) = 0.04 Posterior(A/A) = <.001
P(reads|A/C)= 0.03125 Prior(A/C) = 0.32 Posterior(A/C) = 0.999
P(reads|C/C)= 0.000097 Prior(C/C) = 0.64 Posterior(C/C) = <.001
30
What if....
30
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ Reference Genome
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG
ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC
TAGCTGATAGCTAGATAGCTGATGAGCCCGAT
Sequence Reads
Population Based Prior: Use frequency information from examining others at the same site. In the example above, we estimated P(A) = 0.20
P(reads|A/A)= 0.00000098 Prior(A/A) = 0.04 Posterior(A/A) = <.001
P(reads|A/C)= 0.03125 Prior(A/C) = 0.32 Posterior(A/C) = 0.999
P(reads|C/C)= 0.000097 Prior(C/C) = 0.64 Posterior(C/C) = <.001
31
How are bases called
31
Population Based Prior •Uses frequency information obtained from examining other individuals
•Calling very rare polymorphisms still requires 20-30x coverage of the genome
•Calling common polymorphisms requires much less data
Haplotype Based Prior or Imputation Based Analysis •Compares individuals with similar flanking haplotypes
•Calling very rare polymorphisms still requires 20-30x coverage of the genome
•Can make accurate genotype calls with 2-4x coverage of the genome
•Accuracy improves as more individuals are sequenced
32
What good is using statistics?
32
.5 – 1% 1 – 2% 2-5%
400 Deep Genomes (30x - current cost ~$4,000,000 - 2nd quarter ~$2,000,000)
Discovery Rate 100% 100% 100%
Het. Accuracy 100% 100% 100%
Effective N 400 400 400
3000 Shallow Genomes (4x - current cost ~$4,000,000 -> $2,000,000)
Discovery Rate 100% 100% 100%
Het. Accuracy 90.4% 97.3% 98.8%
Effective N 2406 2758 2873
This would cover essentially ALL 10x106 common SNPs
in 2800 individuals. Affy cost now - $1,000,000 for 1/10
the number of common SNPs.
33
Genotyping by sequencing
• Even with current technology, sequencing
can be an alternative to chip genotyping
• Costs ~$5K per person for deep coverage
• Costs ~$800 per person for 4X coverage
• Using hapmap + knowledge about alleles,
can study all SNPs with MAF > 1-2%
33
34
Does sequencing help?
• Too soon to be sure - but probably so
• Best work in 1000 genomes project
– 2 deeply sequenced trios
– 179 whole genomes sequenced at low coverage
– 8,820 exons deeply sequenced in 697 individuals
15M SNPs, 1M indels, 20,000 structural variants
34
35
Some highlights
35
Highlights Reduced Diversity Extending ~120kb Around Genes
36
Allele frequency spectrum
36
37
Does sequencing improve association:
Expression QTL example TIMM22
37
38
Does sequencing improve association:
Expression QTL example TIMM22
38
39
Does sequencing improve association:
Expression QTL example TIMM22
39
40
Imputation
• Given detailed allele distribution data, it is
possible to “guess” genotypes based on
hapmap/neighboring markers
• Improves with better understanding of
allele distribution
• Allows conversion of Affy/Illumina chip
data into more detailed SNP information
electronically
40
41
Imputation
41
Reference Imputation Accuracy (r2)
Panel Release Date MAF 1-3% MAF 3-5% MAF >5%
1000G Pilot (final) June 2010 ~0.69 ~0.77 ~0.91
280 EUR (draft) November 2010 ~0.73 ~0.78 ~0.92
• As more samples are sequenced, ability to impute individual SNPs improves
• As more samples are sequenced, it becomes possible to impute additional markers
42
Status of 1000 Genomes
• 25,487,060 variant sites called on 629 samples – 7,922,125 sites in dbSNP 129 – 17,564,935 sites not in dbSNP 129 – 98.8% of HapMap III sites rediscovered – Transition/transversion ratio of 2.21 vs 2.04 in
Pilot
• As of November 2010: – 1103 sequenced samples – 22.6 Tb of raw sequence data
42