genome wide methodologies and future perspectives
DESCRIPTION
Slide notes loosely follow what was presented.TRANSCRIPT
Genome Wide Methodologies and Future Perspectives
Brian Krueger, PhDDuke University
Center for Human Genome Variation
• Mendel’s Laws– Law of segregation
• Each parent randomly passes one of two alleles to offspring
– Law of Independent Assortment• Separate genes for separate traits are passed independently to
offspring• Traits should appear in offspring in the ratio of 9:3:3:1
– Laws hold true for genes on different chromosomes or genes located far away from one another
• Linkage– Bateson and Punnett quickly found traits that didn’t
assort independently– Thomas Hunt Morgan and his student Alfred
Sturtevant found that recombination frequency is a good predictor of distance between genes• Genes that are inherited together must be closer to one another
– linked• Generated the first linkage maps
– Serves as an important basis for understanding genetic association studies
History of Genetic Linkage
• Model Organisms– Fruit Flies, plants, etc– Extremely important for understanding human
genetics– Fruit flies can produce new generations of 400+
offspring approximately every week!• Can very quickly understand the genetics of trait heritability
• Familial Linkage Studies– Require multiple generations– Take decades to develop– Complicated by family participation
• Association studies– Subtle difference between linkage studies– Try to apply knowledge of familial linkage to entire
populations
Linkage Studies
• GWA studies– Aim to find genetic variants that are associated with
traits– Typically used to elucidate complex disease traits– Focus on SNPs, Indels, CNVs– Most often Case/Control Studies
• SNP (Single Nucleotide Polymorphism)– Change in a single nucleotide position
• Indel (Insertion/Deletion)– Describes the insertion or deletion of nucleotides
• CNV (Copy number variations)– Large deletions or duplications of genetic material
Genome Wide Association Studies
• Human Genome Project (1990-2000)– Decade long international project to determine the
complete human genome sequence– Provided the reference genome for future research on
genome variation
• Human HapMap (2002-2009)– Sequencing whole genomes is expensive– Needed a shortcut to understand how variation
contributes to disease– Mapped millions of common known SNPs in 269
individuals– Theory that common SNPs are inherited and could be
predictive of associated disease– Determine how SNPs from case/control studies
associate with human disease
GWA Study History
• Variants are not always causal!– SNPs sometimes only serve as markers– Can play absolutely no role in the disease and even be
located on different chromosomes from the gene actually responsible for the phenotype
• Population stratification– Variants differ by population– Variants important markers of disease in one
population or ethnicity may not be effective markers in another
– For GWA studies to be effective predictors in multiple populations, large datasets for each ethnicity must be obtained
Defining Association
GWAS SNP Genotyping
0 0.20 0.40 0.60 0.80 1
Norm Theta
rs1372493
-0.20
0
0.20
0.40
0.60
0.80
1
1.20
1.40
1.60
No
rm R
2317 834 74
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
Intensity (A)
rs1372493
-2000
0
2000
4000
6000
8000
10000
12000
14000
16000
Inte
nsity
(B)
• Bead array genotyping– Uses a chip containing beads with
covalently attached baits– Baits hybridized to fragmented DNA– Baits SPECIFIC for the DNA just upstream
of a SNP– Base extension with fluorescently labeled
bases allows interrogation of the SNP (each base has a different color!)
– A single bead chip can assay millions of SNPs
– Colorimetric output plotted• Blue indicates homozygous for one version of the
SNP - CC• Purple is heterozygous - CA• Red homozygous for the other version of the SNP
- AA
• Realtime PCR– Use specific PCR probes to verify SNPs– Good for validating a handful of SNPs at a time
• Mass Array– Use mass spec to find SNPs– Detected by looking at fragment weight
differences– Good for detecting or validating a large number
of SNPs rapidly
• Sanger sequencing– Gold standard validation method– Can determine the SNP at its exact position– Very robust
GWAS SNP Genotyping and Validation
• To this point in time, the power of most GWA studies was lacking– GWA not really genome wide– Looked at common variants across genome– Missed rare variants and not always descriptive of
disease causation
• Whole Genome Sequencing (WGS)– Actually assays the entire genome– Discovers all variants– Prohibitively costly before 2008– Current cost of WGS ~$4000
• Thousand Genomes Project (2008-)– Facilitated by plummeting sequencing costs and
technological advancements– Goal to fully sequence the genomes of 1000 healthy
individuals to provide a true picture of genome wide variation
GWA Study History
• Developed to increase throughput of Sanger sequencing
• Can sequence many molecules in parallel– Does not require homogenous input– Sequenced as clusters
• Sequencing by synthesis– Bases are added, signals scanned, and then
washed– Cycle repeated (30-2000x)
Second Generation Sequencing
2nd Gen: Sequencing by Synthesis Overview
Genomic DNA
Fragmented DNA
Repeat Hundreds of times on millions of clusters
ATTGATTG
TA
C
T
G
Ligate Adaptors
ATTGATTG
TA
C
T
G
Generate Clusters (On Flowcell or Beads)
Add BasesDetect Signals
Flavors of Sequencing
• Whole Genome Sequencing– Obtain whole blood or tissue sample– Create sequencing libraries of all DNA
fragments
• Whole Exome Sequencing– Utilizes a selection protocol– Attach complimentary RNA strands to beads– Fish out ONLY coding DNA sequences– Create sequencing libraries from enriched DNA– Reduces cost significantly
• Custom Capture– Same protocol as Exome sequencing– Only target desired DNA sequences
• Amplicon Sequencing– Use PCR to amplify target DNA– Sequence amplified DNA (Amplicon)
NGS Study Designs for Gene Discovery
Multiplex families
Case-control studies
Trio sequencing of sporadic diseases
De novo Mutation Calling/Filtering
Individual variant calling
Multi-sample variant calling
Variant calling
Visual Inspection
Cross-checking public databases
Sanger sequencing confirmation
Exome Variant Server 6500 exome sequenced individuals
Detecting Copy Number Variants
heterozygousdeletion
homozygousdeletion
duplication
Windows
ERDS (Estimation by Read Depth with SNvs) Average read depth (RD) of every 2-kb window were calculated, followed by GC corrections. A paired Hidden Markov model was applied to infer copy numbers of every window by utilizing both RD information and heterozygosity information.
• Uses a flow cell • Cluster generated on slide via bridge
amplification• Sequencing by synthesis
– Performed by flowing labeled bases over flow cell
– 4 pictures taken (one for each base)– Cluster color determined at each cycle allows
interrogation of sequence
• Advantages– Low cost per base– Very high throughput
• Limitations– High cost per experiment– Short read length (30-150bp)– Acquired a company that uses new tech to
reach read lengths of 2-10Kb
Illumina
Schadt et al 2010 HMG
• Emulsion PCR is used to generate clusters on a bead
• Sequencing by synthesis– Pyrosequencing– Relies on release of pyrophosphate for
detection– Instead of a visual cue, system senses the
release of H+ as each base is flowed over the beads
• Advantages– Short run time– Does not require modified bases– Longer read length (200bp)
• Limitations– Low data output– High homopolymer error rate
Ion Torrent
• Defined as single molecule sequencing• Less complex sample prep• Much longer read length
– SGS Short read length a huge disadvantage for de novo sequencing applications
• Two categories– Sequencing by synthesis– Direct sequencing
• Passing molecule through a nanopore• Using atomic force microscopy
• Bleeding edge technology– Many technical hurdles– Currently very high error rates
Third Generation Sequencing
• Utilizes single molecule sequencing by synthesis
• Extremely complex system– Each well contains a single DNA molecule and
an immobilized polymerase– No reagent washing– Employs confocal microscopy to only detect
fluorescence at the polymerase
• Advantages– Very long read length (1-15kb)– Low complexity sample prep– Very fast data generation (real time)
• Disadvantages– Prone to sequencing errors (~15% error
rate)– Company on the verge of bankruptcy
Pacific Biosciences
• Currently only one viable high throughput long read sequencing platform– PacBio system has a 15% error rate– Need long reads for many applications from de
novo sequencing to haplotyping
• Second generation sequencers high throughput and accurate– Short reads are hard to assemble and leave
gaps in repetitive sequences
• Can use both as a highly accurate and extremely powerful tool for de novo sequencing applications– Use PacBio assembly as a scaffold– Correct errors by aligning HiSeq reads on top– Effective error rate of 0.1%– Expensive but extremely fast and accurate
compared to other methods
Third/Second Generation Sequencing
Koren et al 2012 Nature Biotechnology
• Leading candidate is Oxford Nanopore
• Concept– Detect flow of electrons through the
pore– Each base causes a detectable change in
the current– Results in direct sequencing– Theoretically could be used to sequence
RNA and protein too
• Advantages– Long read length– Plug and play– Easily scalable
• Disadvantages– No hard data yet– No specific release date
Future: Nanopore Sequencing
Credit: John MacNeill/TechnologyReview
• Concept stage techniques– Significant technical hurdles to overcome– Mostly proof of concept experiments
• IBM DNA Transistor– Bases read as single stranded DNA passes
through the transistor– Gold bands represent metal, gray bands are
the dielectric
• Atomic force microscopy sequencing– Use AFM tip to detect each base of single
stranded DNA
Future: Direct sequencing
Credit: IBM
Credit: Lee et al US PAT 20040124084
• Old techniques which used to take days or years to perform can now be completed in hours
• Next generation sequencing has opened a new door for addressing very complicated genetic questions– Has huge potential to revolutionize human
healthcare– Survey complex tumor types– Research into macro and micro community
genomics– Reveal evolutionary history
Sequencing Applications
• Human genome took 10 years to complete and cost $3 billion dollars– Done by laboriously cloning overlapping segments of the
human genome into bacmid libraries and Sanger sequencing each one
– Genome assembled using computers to line up over lapping sequences
• Current estimate is around $4000– Can be completed in a week– Companies like Complete Genomics say they have already
sequenced thousands of human genomes
• Future– Long read sequencers will make agricultural sequencing
more viable– Whole genome sequencing for human diagnostics will
become routine– Increasing the catalog of organismal genomes will improve
our understanding of evolution and development
De novo Sequencing
• Previously done by completing complicated and time consuming familial linkage studies and targeted Sanger sequencing
• Next generation sequencing can look at every gene at once– Can produce a genetic map of the complete
genome– Used to detect genetic polymorphisms– See every possible mutation
• Future– Whole genome sequence analysis– Targeted genome sequencing analysis using
predetermined sequence selection arrays (ex: Exome Enrichment)
Genome Mutation Analysis
• Very hot topic in the biotech and insurance industries
• Use genetic typing to guess how a person might respond to different drug treatments
• Currently relies on microarrays• NGS could provide significantly more
information at more loci– Microarrays only look at a handful of
polymorphisms– Current NGS approaches port the microarray
technique to enrich pools for sequencing
• Future– As the catalog of human genomes increases, it
will be easier to calculate responses to treatment before drugs are administered
Pharmacogenetics
Gauthier et al 2007 Cancer Cell
• Defined as heritable genetic information that is not coded in the DNA bases– DNA methylation– Histone modifications
• Previous mechanisms for detecting these Chromatin or DNA modifications relied on targeted probing– ChIP-PCR– Bisulfite sequencing– Footprinting assays
• Next generation sequencing changed everything– Whole genome methylation mapping (MAP-IT)– Whole genome histone modification and
protein binding mapping (ChIP-Seq - acetylation, methylation, etc)
• ENCODE project
Epigenetics
• International project– Follow up to the human genome project
• Only 98% of the human genome codes for protein– Creating and maintaining DNA is biochemically
expensive– What’s the other 98% of the genome doing?
• ENCODE goals– Determine the functional elements of the
human genome– Protein Coding– Non-Coding RNA– mRNA Expression– Regulatory protein binding sites– Histone modifications
• Preliminary estimates show that 80% of human DNA is functional!
ENCyclOpedia of Dna Elements (ENCODE)
• Gene expression analysis is important for disease discovery and cancer diagnosis
• Expression analysis first relied on Northern blotting followed by DNA microarrays– Both cases require a probe– Need to “know” what you are looking for– Low resolution screening
• Next generation approaches screen the entire transcriptome (RNA-Seq)– Single base resolution of expression– Can see level of expression and also visualize
mutations in expressed sequences
• Future– Important for diagnosing/treating cancer and
heritable diseases
Transcriptome/Expression Analysis
• NGS data generates huge datasets with 85-99.9% base accuracy– Must determine which signals are real, and
which are noise/errors– Most promising hits are validated by other
assays (Sanger, qRT, Mass Spec)– How do we determine which hits to validate?
• Currently have very small datasets, even in pharmacogenetics that have limited utility
• Validated hits can be distractions– Tumor diversity presents multiple escape routes
during targeted treatment
• Future– Require large validated datasets that are
ethnically and geographically diverse
Phenotypic Correlation
See NYTimes Series on whole genomeSequencing: http://nyti.ms/No4fgd
• Used to survey macro and micro environments– Microbial communities (Soil/Gut)– Tumors– Plant communities– Coral reef ecosystems
• Previous techniques coupled mtDNA or ribosomal Sanger sequencing with BLAST analysis– Limited by number of sequenced species– Can determine who, but not what is going on
• NGS approaches now being used to determine exactly what organisms are present and how they interact– Can get expression data and link it back to
community groups– Survey community diversity
Metagenomics
• Absolutely the largest roadblock for next generation sequencing
• Terabytes of data are useless if we can’t efficiently analyze the data
• How long should data be kept?– Depends on application
• Human Diagnostic sequencing?• Research sequencing?
• Where should data be kept and processed?– Local or Cloud (Amazon, etc)? – Cost of infrastructure vs cost of cloud service– Security issues
• Future– Cloud based solutions will become more
attractive
Data