genomics and personalized care in health systems lecture 10. high throughput technologies

57
Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management

Upload: mab

Post on 14-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies. Leming Zhou, PhD School of Health and Rehabilitation Sciences Department of Health Information Management. Outline. Polymerase Chain Reaction (PCR) Genome Sequencing Microarray Pathway Analysis. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Genomics and Personalized Care in Health Systems

Lecture 10. High Throughput Technologies

Leming Zhou, PhD

School of Health and Rehabilitation Sciences

Department of Health Information Management

Page 2: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Outline• Polymerase Chain Reaction (PCR)

• Genome Sequencing

• Microarray

• Pathway Analysis

Page 3: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Polymerase Chain Reaction (PCR)

Page 4: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Polymerase Chain Reaction (PCR)• A technique that allows us to generate a large

number of a particular DNA sequence from an extremely small sample

• Procedure:– Determine one particular sequence – the target sequence– Mix sample, primers, nucleotides to build new DNA

strands– Apply cycles of heating, cooling, reheating on the mixture

• The number of the target in the mixture will grow exponentially with the number of cycles

• Primer selection is critical. The primers should be at least 15-20 bases to ensure specificity.

• If you are unsure of the exact sequence, you can use a mixture of primers (vary at third codon position)

Page 5: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Double-stranded DNA target

primers

Primers are complementary to opposite ends of target seq.

PCR

Page 6: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

PCR

Page 7: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

PCR Applications• Making a lot of protein

– Use RT-PCR, “reverse transcriptase” PCR, to create DNA with introns removed and then insert it into bacteria to clone the gene. e.g. to make proteins for X-ray crystallography.

• Medical diagnosis – Detect HIV viral proteins long before AIDS symptoms

arise

– Rapid tuberculosis test

• Forensics– Detect trace amounts of DNA at a crime scene

• …

Page 8: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Genome Sequencing

Page 9: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

DNA Sequencing• The process of determining the order of the

nucleotide bases along a DNA strand• In 1977 two separate methods for sequencing

DNA were developed: – Chain termination method (Sanger et al.) – Chemical degradation method (Maxam and Gilbert)– Both methods were equally popular to begin with, but,

for many reasons, the chain termination method is the method more commonly used later

• Chain termination method is based on the principle that single-stranded DNA molecules that differ in length by just a single nucleotide can be separated from one another using polyacrylamide gel electrophoresis

Page 10: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Chain Termination Method• Idea: If we know the distance of each type of base from a

known origin, then it is possible to deduce the sequence of the DNA.

• For example, if we knew that there was an:– A at positions 2, 3, 11, 13 ... G at positions 1, 12, ... C at positions 6, 7, 8,

10, 15... T at positions 4, 5, 9, 14....then we can reconstruct the sequence

• Obtaining this information is conceptually simple. The idea is to cause a termination of a growing DNA chain at a known base (A,G,C or T) and at a known location in the DNA

• In practice, chain termination is caused by the inclusion of a small amount of a single dideoxynucleotide base in the mixture of all four normal bases (e.g. dATP, dTTP, dCTP, dGTP and ddATP). The small amount of ddATP would cause chain termination whenever it would be incorporated into the DNA.

Page 11: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Automatic DNA sequencing

Page 12: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Whole Genome Shotgun Sequencing

Page 13: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Metrics for Evaluating Sequencing Methods• Throughput

– Number of high quality bases per unit time– Difficulty of sample preparation– Number of independent samples run in parallel -

multiplexing• Yield

– Number of useful reads per sample– Read length

• Cost– Per run and per base; Equipment; Reagents;

Infrastructure; Labor; Analysis• The goal of all new sequencing technologies is to

increase throughput and yield while reducing cost

Page 14: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Sanger Sequencing

• Radiolabeled dideoxyNTPs

• 800 bp reads

• Low throughput (several kb/gel)

Page 15: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Next Generation Sequencing• Increasing sequencing production

– Massive parallelization– Reduction in per-base cost– Eliminate need for huge infrastructure– Millions of reads (>1Gb sequence per run)

• Technologies– 454– SOLiD– Illumina– …

• Challenges– Read length– Quality– Data analysis

Page 16: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

454• Throughput & Yield

– 1 million 400 bp reads/10 hour run

– >8 samples/run (more with barcoding)

• Cost– Machine: $500k; reagents ~$8000k/run

• Issues– High indel rate in homopolymers

– Longer reads but fewer than other systems

Page 17: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Other Short Read Technologies• Illumina

– Sequencing by synthesis– 100 million 36-75 bp reads/run– $6500 in reagent cost/run– 3-6 day run time

• SOLiD– Sequencing by ligation– ~400 million 35-50 bp reads/run– ~$5000 in reagent cost/run– 3-6 day run time

• Helicos– Sequencing by synthesis– No amplification– 750 million reads/run– $18k run cost– 8 day run time

Page 18: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Third-Generation Sequencing• Extremely high-throughput sequencing at very

low cost

• Pacific Biosciences– Sequence in real time with fluorescent NTPs– Rate limited by processivity of polymerase– Very long reads (>10 kb)– Not well parallelized (few reads)

• Nanopore sequencing– Sequencing by exonuclease cleavage of native DNA– Bases are read as they pass through a modified

nanopore • base-specific change in current

Page 19: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Genome Sequencing Videos• Wash U Genome Center

– http://www.nslc.wustl.edu/elgin/genomics/gscmaterials.html

– Sanger Technology Tour Videos

– Next Generation Technology Tour Videos http://gep.wustl.edu/curriculum/course_materials_WU/introduction_to_genomics/nextgen_video_tour

• Other videos

– PCR: http://www.youtube.com/watch?v=eEcy9k_KsDI

– Sanger: http://www.youtube.com/watch?v=aPN8LP4YxPo

– SOLiD: http://www.youtube.com/watch?v=nlvyF8bFDwM

– Solexa: http://www.youtube.com/watch?v=77r5p8IBwJk

– Helicos: http://www.youtube.com/watch?v=TboL7wODBj4

Page 20: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

DNA Sequence Assembly

Page 21: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Outline• Basic concepts in sequence assembly

– whole-genome shotgun methods• Sources of error in assemblies

– Repeats– Polymorphism– Sequencing errors

• Alignment and assembly of next-generation sequencing data– Tiling reads onto reference vs. de novo assemblies

Page 22: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Whole Genome Assembly• Multiple copies of the genome are broken into

pieces• Both ends of every piece are read.• Length (and orientation) of each piece form

constraints.• Reads: 500-1000 bp• Quality array for each position.• Reconstruct genome from reads and constraints.• Issues: both ends of a read usually low quality,

chimeric reads, repetitive regions.

Page 23: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

DNA Sequencing Data Set• Millions of reads, some of them are low quality

reads• Millions of constraints, such as paired ends,

quality values• After removing repeats, if two reads overlap large

enough, merge• A contig is an ordered and oriented list of

overlapping reads.• A scaffold is an ordered and oriented list of

contigs.

Page 24: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Scaffolds

Page 25: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Sequence Assembly: Basic Approach

Generate reads

Find overlapping reads

Assemble reads into contigs

Join contigs into scaffolds using mate pairs

Join scaffolds into “finished” sequence

Page 26: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Alignment and Assembly with Short Reads• Map to reference genome

– Many tools• De novo assembly

– Much harder– Reference-guided assembly (MOSAIK)– “True” de novo assebmly (Velvet)

Page 27: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Many DNA Assembly Systems• PHREP

• CAP

• Euler

• Celera Assembler

• Arachne

• LSA

Page 28: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Microarray Technique

Page 29: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Microarrays• Used to study gene expression levels in cells.

• Cells can differ dramatically in the amounts of various proteins that they synthesize; e.g. due to different cell types or different external/internal conditions.

• In fact, in higher level organisms only a fraction of the genes in a cell are expressed at a given time, and that subset depends on the cell type.

• Via microarrays it is possible to study the expression levels of tens of thousands of genes simultaneously.

Page 30: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Microarray Technology• A microarray is a glass slide with spots of DNA on it;

each spot is a probe (or target). Thousands of probes can fit on a single slide. The slides can be spotted by robots.

• The DNA is single-stranded cDNA and may consist of an entire gene or part of one

• If the microarray is exposed to a solution containing mRNA, then the mRNA molecules will bind to those probes to which they are complementary

• Genes you can study with a microarray depends on the collection of probes on it.

• There are a number of commercial manufacturers; e.g. Affymetrix

Page 31: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Microarray Probes

Single-stranded cDNA sequences

Page 32: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Microarray Experiments• Start with two cell types, e.g. “healthy” and “diseased”.

• Isolate mRNA from each cell type, generate cDNA with fluorescent dyes attached, e.g. green for healthy and red for diseased.

• Mix the cDNA samples and incubate with the microarray.

• After incubation the cDNA in the samples has had a chance to bind (hybridize) with the probes on the chip.

• The chip is read by a scanner that uses lasers to excite the fluorescent tags; the intensity levels of the dyes are recorded for each probe gene and stored in a computer.

Page 33: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

The Colors of a Microarray• Green: control DNA, where either DNA or cDNA

derived from normal tissue is hybridized to the target DNA

• Red: sample DNA, where either DNA or cDNA is derived from diseased tissue hybridized to the target DNA

• Yellow: a combination of control and sample DNA, where both hybridized equally to the target DNA

• Black: areas where neither the control nor sample DNA hybridized to the target DNA

• The location and intensity of a color can tell us whether the gene, or mutation, is presented in either the control and/or sample DNA

• It may also provide estimate of the expression level of the gene(s) in the sample and control DNA

Page 34: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Microarray Data Representation• Microarray data is often arranged in an n x m

matrix M with rows for n genes and columns for m biological samples in which gene expression has been monitored. – mij is the expression level of gene i in sample j. – A row ei is the gene expression pattern of gene i over all

the samples. – A column sj is the expression level of all genes in a

sample j and is called the sample expression pattern

Page 35: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Microarray Data Analysis• Gene chips allow the simultaneous monitoring of

the expression level of thousands of genes. Many statistical and computational methods are used to analyze this data – Statistical hypothesis tests for differential expression

analysis

– Principal component analysis and other methods for visualizing high-dimensional microarray data

– Cluster analysis for grouping together genes or samples with similar expression patterns

• Different clustering algorithms may be used, e.g. hierarchical with different metrics, or k-means, k-medians.

– Hidden Markov models, neural networks and other classifiers for predictively classifying sample expression patters as one of several types (diseased vs. normal)

Page 36: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

For What Do We Use Microarray Data• Genes with unusual expression levels in a sample• Genes whose expression levels vary across

samples– This can be used to compare normal and diseased

tissues or diseased tissue before and after treatment.• Samples that have similar expression patterns

– This can also be used to compare normal and diseased tissues or diseased tissue before and after treatment.

• Tissues that might be diseased– We can take the gene expression pattern of sample and

compare it to library expression patterns that indicate diseased or not diseased tissue.

Page 37: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Statistical Methods Can Help• Data Pre-processing

– Normalization: rescaling data from different microarrays so that they can be compared

– Center: subtracting the mean and dividing by the variance.• Data Visualization

– Principle component analysis and multidimensional scaling are two useful techniques for reducing multidimensional data to two and three dimensions. This allows us to visualize it.

• Cluster Analysis– By associating genes with similar expression patterns, we might be

able to draw conclusions about their functional expression.• Statistical Inference

– This is the formulation and statistical testing of a hypothesis and alternative hypothesis.

• Classifiers for the Data– We can construct classes from data, such a diseased vs. non-

diseased tissue. We can build a model that fits know data for the different classes. This can the be used to classify previously unclassified data.

Page 38: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Measuring Dissimilarity of Expression Data• We might want to compare two or more gene or

sample expression patterns

• This might be used to differentiate between diseased and normal cells or finding out the genetic similarity of tissues.

• To do this we need a distance metric or a dissimilarity measure.

Page 39: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Example Distance Metric• Euclidean Distance-This is the most common

distance measure.

• This should not be used if either– Not all components of the vectors being compared have

equal weight.

– There is missing data.

• Preprocessing the data can often alleviate these problems.

• We can also use the normalized Euclidean distance

Page 40: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Cluster Analysis of Microarray Data• Hierarchical Clustering-Assume each data point

is in a singleton cluster.– Find the two clusters that are closest together. Combine

these to form a new cluster.– Compute the distance from all clusters to new cluster

using some form of averaging.– Find the two closest clusters and repeat.

• K-Means Clustering: partitions the data into k clusters and finds cluster means for each cluster. – Usually, the number of clusters k is fixed in advance. To

choose k something must be know about the data. There might be a range of possible k values.

– To decide which is best, optimization of a quantity that maximizes cluster tightness i.e. minimizes distances between points in a cluster

Page 41: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Challenges in Microarray Analysis• Different platforms

– Ilumina, Affymetrix, Agilent….

• Many file types, many data formats

• Need to learn platform dependent methods and software required

• Analysis– How to get started?

– Which methods? Which software?

• Many freely available tools. Some commercial

– How to interpret results

Page 42: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Public Databases• Many sources for public data – labs, consortia,

government

• Publications require that data files including raw files be made public

• GEO– http://www.ncbi.nlm.nih.gov/geo/

• Array Express – http://www.ebi.ac.uk/arrayexpress/#ae-main[0]

Page 43: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Data Analysis• Class discovery

• Class comparison

• Class prediction

• Biological annotation

• Pathway analysis

Page 44: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Hierarchical Clustering

• Eisen Cluster and Treeview– http://rana.lbl.gov/EisenSoftware.htm

• Import data• Filter

– Filter or not to filter, %P calls, SD etc• Adjust data

– Log transform, center, normalize• Clustering

– Cluster array or genes– Computationally intensive– Choose distance metric

• .cdt file created– Open with Treeview

Page 45: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Cluster from Microarray Data

Page 46: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Experimental Design• Sample size

– How many samples in test and control

• Replicates– Technical vs. biological

• Biological replicates is more important for more heterogeneous samples

• Need replicates for statistical analysis

• All experimental steps from sample acquisition to hybridization– Microarray experiments are very expensive. So, plan

experiments carefully

Page 47: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Video on YouTube• DNA Microarray

– http://www.youtube.com/watch?v=VNsThMNjKhM&

Page 48: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Pathway Analysis

Page 49: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

KEGG• Kyoto Encyclopedia of Genes and Genomes

(KEGG) http://www.genome.jp/kegg/pathway.html

Page 50: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Biological Pathwayshttp://www.sabiosciences.com/

Page 51: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Microarray Data Analysis

Statistical packages

Literature findings

•Tools Ingenuity IPA GeneGO Metacore BioBase ExPlainBiology

Page 52: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Microarray Processed Data

Page 53: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Ingenuity IPA• Search and Explore

– Genes, proteins, diseases and chemicals

– Connect genes

– Build pathways

– Explore pathways

• Analyze dataset– Interpret high-throughput data in the context of

biological processes, pathways and networks

Page 54: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

IPA Analysis

Page 55: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Interaction Network Maps

Page 56: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

IPA Analysis

Page 57: Genomics and Personalized Care in Health Systems Lecture 10. High Throughput Technologies

Department of Health Information Management

Pathway Map: p53