the ultimate genotyping experiment: … the ultimate genotyping experiment: determination of human...

1

The Ultimate Genotyping Experiment:

Determination of Human DNA Sequences

Dept. of MCD Biology

Institute for Behavioral Genetics

Center for Adolescent Drug Dependence

University of Colorado

2

Overview

Why DNA sequencing will become the tool of

choice to study genotype/phenotype

relationships for heritable traits

What is the current technology that makes it work

How to make whole genome sequencing affordable

for large-scale genetic studies

3

State of the art: Association studies via

Genome Wide Association

Strategy: survey 105 - 106 well-known single

nucleotide polymorphisms (SNPS) in large

populations - score for co-variation with trait

– “Skims” genetic variation and can allow

correlation with trait of interest

– Only 5-10% of the ~10 million common SNPs

– Based on “common allele, common disease”

model

4

When it is good, it is very, very good

• Hundreds of successful GWAS studies for

several phenotypes (i.e. diabetes,

hypertension, asthma, height, obesity)

• Depends alot on “power” which is

proportional to the number of people

studied

• Also depends critically on phenotype

5

Example: Blood lipids

• In an analysis by a group lead by Gonzalo

Abecasis U. of Michigan

– combined 41 samples, >100,000 genotypes

– Phenotype: Fasting lipids

(LDL,HDL,Triglycer.)

– No medicated people studied

– 2.5 x 106 SNP (typed+imputed); MAF >=1%

• Identified 95 loci that associate with levels

6

How good is this? Very, very

good.

• OMIM had reported 18 genes affecting lipids

– 15 of them within 100kb of GWAS hit

– 8 within 10kb

• Computer simulations of alleles randomized

averaged <1 within 100kb and not 1/106

simulations had more than 8.

7

Does it mean anything?

• One GWAS allele (40% frequency) was

found to be in a GALNT2 (glycosylation)

• Allele causes only +/- 1mg/dl HDL-C

Teslovich et al, in press

8

Does it mean anything?

• One GWAS allele (40% MAF) was found

to be in a GALNT2 (glycosylation)

• Allele causes only +/- 1mg/dl HDL-C

• In mouse--

– Overexpression decreases HDL-C ~20%

– Knockdown increases HDL-C ~30%

• So clearly this gene, that had no known

role in lipid metabolism, CAN be important

9

But when it is bad, it is horrid

Successful studies can account for only a

fraction of the genetic influence on

phenotypic variance for most behavioral

traits despite high heritability – why?

• Genes with high-influence may be lacking

• Phenotypes inappropriately defined

• Insufficient N

• Inability to study rare variants

meaningfully

10

Sequencing can do genotyping

ALOT better

• Whole genome sequencing types ALL

polymorphisms - rare and common

• GWAS done with sequencing has no

“missing” data like chip-based methods

• Linkage (and LD) are not required for

association. More “straighforward”

analysis

• May eventually be cheaper per marker

11

Sharpening tools

Efforts to begin large scale DNA sequencing

to increase power to detect genes are now

being piloted

1000 Genomes Project is pointing the way

– Moderate frequency alleles – low pass

sequencing (low cost/person)

– Rare variants – deep sequencing (high cost)

12

Digression on Sequencing

• Rates of acquisition of DNA sequence

have gone through the roof

• Accuracy improves and costs have

plummeted

• Most of the progress due to determination

of reference human sequence combined

with technological advances in short read

technologies

13

Sequencing is affordable!

13

14


14

15


15

Publications by Year using Illumina

Sequencing Methods

2007 2008 2009 2010

16

Three prominent technolgies

• Illumina - based on solid phase PCR

approach (similar to SoLID system)

• 454 - similarly based on solid phase DNA

synthesis using highly processive process

• PacBio - based on true single-molecule

detection approach - MOST processive

16

17

Illumina

17

18

Illumina

18

19

Illumina

19

20

Roche - 454

21

PacBio Single molecule of

DNA at a time

75,000 reads

simultaneously

1000-10000/read

22

What is good for what?

22

454 SOLID ILLUMINA PACBIO

Method DNA Pol

synthesis

Ligase PCR with fluor.

dNTPs

DNA Pol

special NTPs

Medium Beads Beads Glass surface Optical well

Error

types

Indels at

homopol.

End errors End errors Random indels

Bases/re

ad

400-1000 50 100+100 >1000

Most

common

use

Metagenomics Resequencing

/de novo

Resequencing/

de novo

de novo

microbial

genomes

23

What can these babies do?

• Illumina,

SoLID, 454

can deal

with most of

these

methods

23

Tag profiling

Small RNA Discovery

mRNA-Seq Methylation

Targeted

Resequencing

CNV

DNase I

Hypersensitivity

Metagenomics

ChIP-Seq

ChIA-PET

Bacterial Sequencing

Human Genome

Resequencing

Nucleosome Mapping Molecular

Cytogenetics

De novo

Sequencing

24

What can these babies do?

• Illumina HiSeq 2000 - popular for genotyping

24

HiSeq 2000

Readlength 2X100

Yield / run 250 Gb

Runs / genome 1/2

Depth 54.9x

SNPs 4,232,886

550k GT coverage 99.8%

Genotype concordance 99.3%

25

How are bases called given

finite errors? • One can sequence many times (i.e. 5x coverage

25

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

Reference Genome

GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA

AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG

ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC

ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC

TAGCTGATAGCTAGATAGCTGATGAGCCCGAT

Sequence Reads

Predicted Genotype ?

26

How are bases called given

finite errors?

26

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’

Reference Genome






Sequence Reads

P(reads|A/A , read mapped)= 0.00000098

P(reads|A/C , read mapped)= 0.03125

P(reads|C/C , read mapped)= 0.000097

27

How are bases called

27

5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’ Reference Genome






Sequence Reads

Individual Based Prior: Every site has 1/1000 probability of varying.

P(reads|A/A)= 0.00000098 Prior(A/A) = 0.00034 Posterior(A/A) = <.001

P(reads|A/C)= 0.03125 Prior(A/C) = 0.00066 Posterior(A/C) = 0.175

P(reads|C/C)= 0.000097 Prior(C/C) = 0.99900 Posterior(C/C) = 0.825

28


28

• Individual Based Prior

• Assumes all sites have an equal probability of showing polymorphism

• Specifically, assumption is that about 1/1000 bases differ from reference

• If reads where error free and sampling Poisson …

• … 14x coverage would allow for 99.8% genotype accuracy

• … 30x coverage of the genome needed to allow for errors and clustering

29

What if....

29







Sequence Reads

Population Based Prior: Use frequency information from examining others at the same site. In the example above, we estimated P(A) = 0.20



P(reads|C/C)= 0.000097 Prior(C/C) = 0.64 Posterior(C/C) = <.001

30

What if....

30







Sequence Reads

Population Based Prior: Use frequency information from examining others at the same site. In the example above, we estimated P(A) = 0.20



P(reads|C/C)= 0.000097 Prior(C/C) = 0.64 Posterior(C/C) = <.001

31


31

Population Based Prior •Uses frequency information obtained from examining other individuals

•Calling very rare polymorphisms still requires 20-30x coverage of the genome

•Calling common polymorphisms requires much less data

Haplotype Based Prior or Imputation Based Analysis •Compares individuals with similar flanking haplotypes

•Calling very rare polymorphisms still requires 20-30x coverage of the genome

•Can make accurate genotype calls with 2-4x coverage of the genome

•Accuracy improves as more individuals are sequenced

32

What good is using statistics?

32

.5 – 1% 1 – 2% 2-5%

400 Deep Genomes (30x - current cost ~$4,000,000 - 2nd quarter ~$2,000,000)

Discovery Rate 100% 100% 100%

Het. Accuracy 100% 100% 100%

Effective N 400 400 400

3000 Shallow Genomes (4x - current cost ~$4,000,000 -> $2,000,000)

Discovery Rate 100% 100% 100%

Het. Accuracy 90.4% 97.3% 98.8%

Effective N 2406 2758 2873

This would cover essentially ALL 10x106 common SNPs

in 2800 individuals. Affy cost now - $1,000,000 for 1/10

the number of common SNPs.

33

Genotyping by sequencing

• Even with current technology, sequencing

can be an alternative to chip genotyping

• Costs ~$5K per person for deep coverage

• Costs ~$800 per person for 4X coverage

• Using hapmap + knowledge about alleles,

can study all SNPs with MAF > 1-2%

33

34

Does sequencing help?

• Too soon to be sure - but probably so

• Best work in 1000 genomes project

– 2 deeply sequenced trios

– 179 whole genomes sequenced at low coverage

– 8,820 exons deeply sequenced in 697 individuals

15M SNPs, 1M indels, 20,000 structural variants

34

35

Some highlights

35

Highlights Reduced Diversity Extending ~120kb Around Genes

36

Allele frequency spectrum

36

37

Does sequencing improve association:

Expression QTL example TIMM22

37

38



38

39



39

40

Imputation

• Given detailed allele distribution data, it is

possible to “guess” genotypes based on

hapmap/neighboring markers

• Improves with better understanding of

allele distribution

• Allows conversion of Affy/Illumina chip

data into more detailed SNP information

electronically

40

41

Imputation

41

Reference Imputation Accuracy (r2)

Panel Release Date MAF 1-3% MAF 3-5% MAF >5%

1000G Pilot (final) June 2010 ~0.69 ~0.77 ~0.91

280 EUR (draft) November 2010 ~0.73 ~0.78 ~0.92

• As more samples are sequenced, ability to impute individual SNPs improves

• As more samples are sequenced, it becomes possible to impute additional markers

42

Status of 1000 Genomes

• 25,487,060 variant sites called on 629 samples – 7,922,125 sites in dbSNP 129 – 17,564,935 sites not in dbSNP 129 – 98.8% of HapMap III sites rediscovered – Transition/transversion ratio of 2.21 vs 2.04 in

Pilot

• As of November 2010: – 1103 sequenced samples – 22.6 Tb of raw sequence data

42

the ultimate genotyping experiment: … the ultimate genotyping experiment: determination of human...

Documents