genome wide methodologies and future perspectives

Genome Wide Methodologies and Future Perspectives

Brian Krueger, PhDDuke University

Center for Human Genome Variation

• Mendel’s Laws– Law of segregation

• Each parent randomly passes one of two alleles to offspring

– Law of Independent Assortment• Separate genes for separate traits are passed independently to

offspring• Traits should appear in offspring in the ratio of 9:3:3:1

– Laws hold true for genes on different chromosomes or genes located far away from one another

• Linkage– Bateson and Punnett quickly found traits that didn’t

assort independently– Thomas Hunt Morgan and his student Alfred

Sturtevant found that recombination frequency is a good predictor of distance between genes• Genes that are inherited together must be closer to one another

– linked• Generated the first linkage maps

– Serves as an important basis for understanding genetic association studies

History of Genetic Linkage

• Model Organisms– Fruit Flies, plants, etc– Extremely important for understanding human

genetics– Fruit flies can produce new generations of 400+

offspring approximately every week!• Can very quickly understand the genetics of trait heritability

• Familial Linkage Studies– Require multiple generations– Take decades to develop– Complicated by family participation

• Association studies– Subtle difference between linkage studies– Try to apply knowledge of familial linkage to entire

populations

Linkage Studies

• GWA studies– Aim to find genetic variants that are associated with

traits– Typically used to elucidate complex disease traits– Focus on SNPs, Indels, CNVs– Most often Case/Control Studies

• SNP (Single Nucleotide Polymorphism)– Change in a single nucleotide position

• Indel (Insertion/Deletion)– Describes the insertion or deletion of nucleotides

• CNV (Copy number variations)– Large deletions or duplications of genetic material

Genome Wide Association Studies

• Human Genome Project (1990-2000)– Decade long international project to determine the

complete human genome sequence– Provided the reference genome for future research on

genome variation

• Human HapMap (2002-2009)– Sequencing whole genomes is expensive– Needed a shortcut to understand how variation

contributes to disease– Mapped millions of common known SNPs in 269

individuals– Theory that common SNPs are inherited and could be

predictive of associated disease– Determine how SNPs from case/control studies

associate with human disease

GWA Study History

• Variants are not always causal!– SNPs sometimes only serve as markers– Can play absolutely no role in the disease and even be

located on different chromosomes from the gene actually responsible for the phenotype

• Population stratification– Variants differ by population– Variants important markers of disease in one

population or ethnicity may not be effective markers in another

– For GWA studies to be effective predictors in multiple populations, large datasets for each ethnicity must be obtained

Defining Association

GWAS SNP Genotyping

0 0.20 0.40 0.60 0.80 1

Norm Theta

rs1372493

-0.20

0

0.20

0.40

0.60

0.80

1

1.20

1.40

1.60

No

rm R

2317 834 74

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Intensity (A)

rs1372493

-2000

0

2000

4000

6000

8000

10000

12000

14000

16000

Inte

nsity

(B)

• Bead array genotyping– Uses a chip containing beads with

covalently attached baits– Baits hybridized to fragmented DNA– Baits SPECIFIC for the DNA just upstream

of a SNP– Base extension with fluorescently labeled

bases allows interrogation of the SNP (each base has a different color!)

– A single bead chip can assay millions of SNPs

– Colorimetric output plotted• Blue indicates homozygous for one version of the

SNP - CC• Purple is heterozygous - CA• Red homozygous for the other version of the SNP

- AA

• Realtime PCR– Use specific PCR probes to verify SNPs– Good for validating a handful of SNPs at a time

• Mass Array– Use mass spec to find SNPs– Detected by looking at fragment weight

differences– Good for detecting or validating a large number

of SNPs rapidly

• Sanger sequencing– Gold standard validation method– Can determine the SNP at its exact position– Very robust

GWAS SNP Genotyping and Validation

• To this point in time, the power of most GWA studies was lacking– GWA not really genome wide– Looked at common variants across genome– Missed rare variants and not always descriptive of

disease causation

• Whole Genome Sequencing (WGS)– Actually assays the entire genome– Discovers all variants– Prohibitively costly before 2008– Current cost of WGS ~$4000

• Thousand Genomes Project (2008-)– Facilitated by plummeting sequencing costs and

technological advancements– Goal to fully sequence the genomes of 1000 healthy

individuals to provide a true picture of genome wide variation

GWA Study History

• Developed to increase throughput of Sanger sequencing

• Can sequence many molecules in parallel– Does not require homogenous input– Sequenced as clusters

• Sequencing by synthesis– Bases are added, signals scanned, and then

washed– Cycle repeated (30-2000x)

Second Generation Sequencing

2nd Gen: Sequencing by Synthesis Overview

Genomic DNA

Fragmented DNA

Repeat Hundreds of times on millions of clusters

ATTGATTG

TA

C

T

G

Ligate Adaptors

ATTGATTG

TA

C

T

G

Generate Clusters (On Flowcell or Beads)

Add BasesDetect Signals

Flavors of Sequencing

• Whole Genome Sequencing– Obtain whole blood or tissue sample– Create sequencing libraries of all DNA

fragments

• Whole Exome Sequencing– Utilizes a selection protocol– Attach complimentary RNA strands to beads– Fish out ONLY coding DNA sequences– Create sequencing libraries from enriched DNA– Reduces cost significantly

• Custom Capture– Same protocol as Exome sequencing– Only target desired DNA sequences

• Amplicon Sequencing– Use PCR to amplify target DNA– Sequence amplified DNA (Amplicon)

NGS Study Designs for Gene Discovery

Multiplex families

Case-control studies

Trio sequencing of sporadic diseases

De novo Mutation Calling/Filtering

Individual variant calling

Multi-sample variant calling

Variant calling

Visual Inspection

Cross-checking public databases

Sanger sequencing confirmation

Exome Variant Server 6500 exome sequenced individuals

Detecting Copy Number Variants

heterozygousdeletion

homozygousdeletion

duplication

Windows

ERDS (Estimation by Read Depth with SNvs) Average read depth (RD) of every 2-kb window were calculated, followed by GC corrections. A paired Hidden Markov model was applied to infer copy numbers of every window by utilizing both RD information and heterozygosity information.

• Uses a flow cell • Cluster generated on slide via bridge

amplification• Sequencing by synthesis

– Performed by flowing labeled bases over flow cell

– 4 pictures taken (one for each base)– Cluster color determined at each cycle allows

interrogation of sequence

• Advantages– Low cost per base– Very high throughput

• Limitations– High cost per experiment– Short read length (30-150bp)– Acquired a company that uses new tech to

reach read lengths of 2-10Kb

Illumina

Schadt et al 2010 HMG

• Emulsion PCR is used to generate clusters on a bead

• Sequencing by synthesis– Pyrosequencing– Relies on release of pyrophosphate for

detection– Instead of a visual cue, system senses the

release of H+ as each base is flowed over the beads

• Advantages– Short run time– Does not require modified bases– Longer read length (200bp)

• Limitations– Low data output– High homopolymer error rate

Ion Torrent

• Defined as single molecule sequencing• Less complex sample prep• Much longer read length

– SGS Short read length a huge disadvantage for de novo sequencing applications

• Two categories– Sequencing by synthesis– Direct sequencing

• Passing molecule through a nanopore• Using atomic force microscopy

• Bleeding edge technology– Many technical hurdles– Currently very high error rates

Third Generation Sequencing

• Utilizes single molecule sequencing by synthesis

• Extremely complex system– Each well contains a single DNA molecule and

an immobilized polymerase– No reagent washing– Employs confocal microscopy to only detect

fluorescence at the polymerase

• Advantages– Very long read length (1-15kb)– Low complexity sample prep– Very fast data generation (real time)

• Disadvantages– Prone to sequencing errors (~15% error

rate)– Company on the verge of bankruptcy

Pacific Biosciences

• Currently only one viable high throughput long read sequencing platform– PacBio system has a 15% error rate– Need long reads for many applications from de

novo sequencing to haplotyping

• Second generation sequencers high throughput and accurate– Short reads are hard to assemble and leave

gaps in repetitive sequences

• Can use both as a highly accurate and extremely powerful tool for de novo sequencing applications– Use PacBio assembly as a scaffold– Correct errors by aligning HiSeq reads on top– Effective error rate of 0.1%– Expensive but extremely fast and accurate

compared to other methods

Third/Second Generation Sequencing

Koren et al 2012 Nature Biotechnology

• Leading candidate is Oxford Nanopore

• Concept– Detect flow of electrons through the

pore– Each base causes a detectable change in

the current– Results in direct sequencing– Theoretically could be used to sequence

RNA and protein too

• Advantages– Long read length– Plug and play– Easily scalable

• Disadvantages– No hard data yet– No specific release date

Future: Nanopore Sequencing

Credit: John MacNeill/TechnologyReview

• Concept stage techniques– Significant technical hurdles to overcome– Mostly proof of concept experiments

• IBM DNA Transistor– Bases read as single stranded DNA passes

through the transistor– Gold bands represent metal, gray bands are

the dielectric

• Atomic force microscopy sequencing– Use AFM tip to detect each base of single

stranded DNA

Future: Direct sequencing

Credit: IBM

Credit: Lee et al US PAT 20040124084

• Old techniques which used to take days or years to perform can now be completed in hours

• Next generation sequencing has opened a new door for addressing very complicated genetic questions– Has huge potential to revolutionize human

healthcare– Survey complex tumor types– Research into macro and micro community

genomics– Reveal evolutionary history

Sequencing Applications

• Human genome took 10 years to complete and cost $3 billion dollars– Done by laboriously cloning overlapping segments of the

human genome into bacmid libraries and Sanger sequencing each one

– Genome assembled using computers to line up over lapping sequences

• Current estimate is around $4000– Can be completed in a week– Companies like Complete Genomics say they have already

sequenced thousands of human genomes

• Future– Long read sequencers will make agricultural sequencing

more viable– Whole genome sequencing for human diagnostics will

become routine– Increasing the catalog of organismal genomes will improve

our understanding of evolution and development

De novo Sequencing

• Previously done by completing complicated and time consuming familial linkage studies and targeted Sanger sequencing

• Next generation sequencing can look at every gene at once– Can produce a genetic map of the complete

genome– Used to detect genetic polymorphisms– See every possible mutation

• Future– Whole genome sequence analysis– Targeted genome sequencing analysis using

predetermined sequence selection arrays (ex: Exome Enrichment)

Genome Mutation Analysis

• Very hot topic in the biotech and insurance industries

• Use genetic typing to guess how a person might respond to different drug treatments

• Currently relies on microarrays• NGS could provide significantly more

information at more loci– Microarrays only look at a handful of

polymorphisms– Current NGS approaches port the microarray

technique to enrich pools for sequencing

• Future– As the catalog of human genomes increases, it

will be easier to calculate responses to treatment before drugs are administered

Pharmacogenetics

Gauthier et al 2007 Cancer Cell

• Defined as heritable genetic information that is not coded in the DNA bases– DNA methylation– Histone modifications

• Previous mechanisms for detecting these Chromatin or DNA modifications relied on targeted probing– ChIP-PCR– Bisulfite sequencing– Footprinting assays

• Next generation sequencing changed everything– Whole genome methylation mapping (MAP-IT)– Whole genome histone modification and

protein binding mapping (ChIP-Seq - acetylation, methylation, etc)

• ENCODE project

Epigenetics

• International project– Follow up to the human genome project

• Only 98% of the human genome codes for protein– Creating and maintaining DNA is biochemically

expensive– What’s the other 98% of the genome doing?

• ENCODE goals– Determine the functional elements of the

human genome– Protein Coding– Non-Coding RNA– mRNA Expression– Regulatory protein binding sites– Histone modifications

• Preliminary estimates show that 80% of human DNA is functional!

ENCyclOpedia of Dna Elements (ENCODE)

• Gene expression analysis is important for disease discovery and cancer diagnosis

• Expression analysis first relied on Northern blotting followed by DNA microarrays– Both cases require a probe– Need to “know” what you are looking for– Low resolution screening

• Next generation approaches screen the entire transcriptome (RNA-Seq)– Single base resolution of expression– Can see level of expression and also visualize

mutations in expressed sequences

• Future– Important for diagnosing/treating cancer and

heritable diseases

Transcriptome/Expression Analysis

• NGS data generates huge datasets with 85-99.9% base accuracy– Must determine which signals are real, and

which are noise/errors– Most promising hits are validated by other

assays (Sanger, qRT, Mass Spec)– How do we determine which hits to validate?

• Currently have very small datasets, even in pharmacogenetics that have limited utility

• Validated hits can be distractions– Tumor diversity presents multiple escape routes

during targeted treatment

• Future– Require large validated datasets that are

ethnically and geographically diverse

Phenotypic Correlation

See NYTimes Series on whole genomeSequencing: http://nyti.ms/No4fgd

• Used to survey macro and micro environments– Microbial communities (Soil/Gut)– Tumors– Plant communities– Coral reef ecosystems

• Previous techniques coupled mtDNA or ribosomal Sanger sequencing with BLAST analysis– Limited by number of sequenced species– Can determine who, but not what is going on

• NGS approaches now being used to determine exactly what organisms are present and how they interact– Can get expression data and link it back to

community groups– Survey community diversity

Metagenomics

• Absolutely the largest roadblock for next generation sequencing

• Terabytes of data are useless if we can’t efficiently analyze the data

• How long should data be kept?– Depends on application

• Human Diagnostic sequencing?• Research sequencing?

• Where should data be kept and processed?– Local or Cloud (Amazon, etc)? – Cost of infrastructure vs cost of cloud service– Security issues

• Future– Cloud based solutions will become more

attractive

Data

genome wide methodologies and future perspectives

Science

genome sequencing wgs

human genome variation

human disease

entire genome

reference genome

common snps

snps good

association variants