111/26/07bcb 444/544 f07 isu dobbs #37- clustering bcb 444/544 lecture 37 brief review: microarrays...

Download 111/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Lecture 37 Brief Review: Microarrays Clustering  Classification Algorithms #37_Nov26 Thanks

If you can't read please download the document

Upload: oscar-poole

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

311/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Assignments & Announcements Mon Nov 26 - HW#6 Due (sometime before 5 PM Mon Nov 26) Mon Dec 3 - BCB 544 Project Reports Due (but no class!) ALL BCB 444 & 544 students are REQUIRED to attend ALL project presentations next week!!! Tentative Schedule: Wed Dec 5: #!: Xiong & Devin (~20’) #2: Tonia (10-15’) Fri Dec 7: #3: Kendra & Drew (~20’) #4: Addie (10-15’) Thurs Dec 6 - Optional Review Session for Final Exam Mon Dec 10 - BCB 444/544 Final Exam (9: :45AM) Will include:40 pts In Class: New material (since Exam 2) 20 pts In Class: Comprehensive 40 pts In Lab Practical (Comprehensive)

TRANSCRIPT

111/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering BCB 444/544 Lecture 37 Brief Review: Microarrays Clustering & Classification Algorithms #37_Nov26 Thanks to: Doina Caragea, KSU Dan Nettleton, ISU 211/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Mon Nov 26 - Lecture 37 Clustering & Classification Algorithms Chp 18 Functional Genomics Wed Nov 28 - Lecture 38 Proteomics & Protein Interactions Chp 19 Proteomics Thurs Nov 30 - Lab 12 R Statistical Computing & Graphics (Garrett Dancik)Fri Dec 1 - Lecture 39 Systems Biology (& a bit of Metabolomics & Synthetic Biology) Required Reading (before lecture) 311/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Assignments & Announcements Mon Nov 26 - HW#6 Due (sometime before 5 PM Mon Nov 26) Mon Dec 3 - BCB 544 Project Reports Due (but no class!) ALL BCB 444 & 544 students are REQUIRED to attend ALL project presentations next week!!! Tentative Schedule: Wed Dec 5: #!: Xiong & Devin (~20) #2: Tonia (10-15) Fri Dec 7: #3: Kendra & Drew (~20) #4: Addie (10-15) Thurs Dec 6 - Optional Review Session for Final Exam Mon Dec 10 - BCB 444/544 Final Exam (9: :45AM) Will include:40 pts In Class: New material (since Exam 2) 20 pts In Class: Comprehensive 40 pts In Lab Practical (Comprehensive) 411/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Seminars this Week BCB List of URLs for Seminars related to Bioinformatics:Nov 29 Thurs - Baker Center Seminar 2:10 Howe Hall Auditorium, Greg Voth Univ. of Utah Greg Voth Multiscale Challenge for Biomolecular Systems: A Systematic Approach Nov 29 Thurs - BBMB Seminar 4:10 in 1414 MBB Sue Gibson Univ. of Minnesota How do soluble sugar levels help regulate plant development, carbon partitioning and gene expression? Nov 30 Fri - BCB Faculty Seminar 2:10 in 102 ScI Shashi Gadia ComS, ISU Harnessing the Potential of XML Nov 30 Fri - GDCB Seminar 4:10 in 1414 MBB John Abrams Univ Texas Southwestern Medical Center Dying Like Flies: Programmed & Unprogrammed Cell Death 511/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Chp 18 Functional Genomics SECTION V GENOMICS & PROTEOMICS Xiong: Chp 18 Functional Genomics Sequence-based Approaches Microarray-based Approaches Comparison of SAGE & DNA Microarrays 611/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Transcriptome = complete collection of all RNAs in a cell at a given time High-throughput analysis of RNA expression: Microarrays - "Gene Chips" most popular Other related methods: SAGE = Serial Analysis of Gene Expression MPSS = Massively Parallel Signature Sequencing Transcriptome Analysis 711/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Which RNAs are detected? mRNAs (& pre-RNAs) alternatively spliced mRNAs rRNAs, tRNAs miRNAs, siRNAs, other regulatory RNAs 2 Major Types of DNA Microarrays: cDNA = "spotted" = low density, glass slides = Southern blot on a slide oligo = "DNA chip" = high density, photolithography "Affy" chip; computationally designed Both types can be made here, in ISU facilities Microarray Analysis 811/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering "Guilt by Association" - Similar expression patterns suggest potential functions for novel proteins Copyright 2006 A. Malcolm Campbell TF is induced 2X & is known to activate genes G1 and G2, both of which are induced 6X. G3 is induced 6X, too. Is it regulated by TF? Clustering of gene expression patterns (with known genes) suggests potential functions for unknown genes - additional experiments are required to test these hypothesized functions. 911/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Gene Expression Pattern Clusters: for several thousand genes!! Copyright 2006 A. Malcolm Campbell Each row represents a different gene Each column represents a different time point Green indicates repression (decrease in RNA) Red indicates induction (increase in RNA) Genes have been clustered so they are near other genes with similar expression patterns. Notice that the genes at the bottom were repressed for the first few time points. 1011/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Microarray Facilities: Center for Plant Genomics Center for Plant Genomics (ISU PSI) - Pat Schnable in Carver Co-Lab GeneChip Facility (ISU Biotech & PSI) - Steve WhithamGeneChip Facility in MBB Research Labs: Pat Schnable (Agron/GDCB) - Facilities for cDNA microarrays Steve Whitham (PlPath) - Facilities for oligo microarrays Google "microarrays" from ISU website>>> Lots more: Jo Anne Powell-Coffman, GDCB: genes induced under oxidative stress Roger Wise, Rico Caldo, Plant Pathology: interaction between multiple isolates of powdery mildew and multiple genotypes of barley Chris Tuggle, Animal Science: genes controlling mammalian embryo development ISU Microarray Researchers & Facilities 1111/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering ISU Microarray Design & Analysis Experimental Design is critical ISU Course: Stat 416/516X Nettleton Statistical Design & Analysis of Microarray Experiments Dan Nettleton (Stat) - Experimental design & statistical analyses Hui-Hsien Chou (Com S) - "Picky" software for designing oligos Di Cook (Stat) "exploRase" software for high-dimensional data analysis & visualization for systems biology Tools from Statistics & Machine Learning are needed ISU Experts: Dan Nettleton & Di Cook, Stat Vasant Honavar, Com S Statistics: ANOVA (Analysis of Variance) R Statistics package ML: Clustering & Classification Algorithms WEKA package GEPAS Many additional resources & tools available online ISU has several Microarray Analysis Suites 1211/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Gene Expression Analysis Doina Caragea 1311/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Microarray Analysis - Questions: How do hierarchical clustering algorithms work? How do we measure the distance between two clusters? (similarity criteria) What are good clusters? Doina Caragea 1411/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Data Analysis Considerations Normalization Combining results from replicates Identifying differentially expressed genes Dealing with missing values Static vs. time series Doina Caragea 1511/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Pattern Recognition in Microarray Analysis Clustering (unsupervised learning) Uses primary data to group measurements, with no information from other sources Classification (supervised learning) Uses known groups of interest (from other sources) to learn features associated with these groups in primary data and create rules for associating data with groups of interest Doina Caragea 1611/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Two Views of same Microarray Experiment Data points are genes Represented by expression levels across different samples/experiments/conditions (ie, features=samples) Goal: categorize genes Data points are samples (eg, patients) Represented by expression levels of different genes (ie, features=genes) Goal: categorize samples Doina Caragea 1711/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Two Ways to View Microarray Data Doina Caragea 1811/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Data Points are Genes Doina Caragea 1911/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Data Points are Samples Doina Caragea 2011/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Clustering: Unsupervised Learning Task 1 Given: a set of microarray results in which gene expression levels are measured under different experimental conditions Do: Cluster the genes, where a gene is described by its expression levels under different conditions Outcome: Groups genes into clusters, where expression of all members of a cluster tend to go up or down together Doina Caragea 2111/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Example: Groups of Genes are Clustered (Green = up-regulated, Red = down-regulated) Genes Experiments (Samples) Doina Caragea 2211/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Visualizing Expression Patterns for Different Clusters Time (10-minute intervals) Normalized expression Gene Cluster 1, size=20Gene Cluster 2, size=43 (from Sharan & Shamir, 2000) Doina Caragea 2311/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Clustering: Unsupervised Learning Task 2 Given: a set of microarray results in which experimental samples correspond to different patients Do: Cluster the experiments Outcome: Groups samples according to similarities in gene expression profiles Doina Caragea 2411/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Examples Cluster samples from mice subjected to a variety of toxic compounds Cluster samples from cancer patients to discover different subtypes of a cancer Cluster samples taken at different timepoints Doina Caragea 2511/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Supervision: Add Class Values Doina Caragea 2611/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Classification: Supervised Learning Task Given: a set of microarray experiments, each done with mRNA from a different patient (but from same cell type from every patient) Patients expression values for each gene constitute the features, and patients disease constitutes the class Do: Learn a model that accurately predicts class based on features Outcome: Predict class value of a patient based on expression levels of his/her genes Doina Caragea 2711/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Methods for Clustering Hierarchical Clustering K-Means Self Organizing Maps (in lab, wont discuss in lecture) many others. Doina Caragea 2811/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Clustering Metrics A key issue in clustering is to determine what similarity / distance metric to use Often, such metric has a bigger effect on the results than actual clustering algorithm used! When determining the metric, we should take into account our assumptions about the data and the goal of the clustering Doina Caragea 2911/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Distance Metrics for 2 n-Dimensional Vectors (e.g., for a series of expression measurements) Euclidean distance Correlation coefficient whereand E(x) is expected value of X Other metrics are also used Doina Caragea 3011/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Measuring Quality of Clusters Compare INTRA-cluster distances with INTER-cluster distances. Good clusters should have big difference Compare computed clusters with known clusters (if there are any) to see how closely they match Good clusters will contain all known and no wrong cluster members Doina Caragea 3111/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering INTRA- vs INTER-Cluster Distances Good! Bad! Doina Caragea 3211/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering How Determine Distances? Intra-cluster distance Min/Max/Avg the distance between -All pairs of points in the cluster OR -Between centroid and all points in the cluster Inter-cluster distance Single link distance between two most similar members Complete link distance between two most similar members Average link Average distance of all pairs Centroid distance What is the centroid? the "average" of all points of X. The centroid of a finite set of points can be computed as the arithmetic mean of each coordinate of the points. Wikipediaaveragearithmetic mean Doina Caragea 3311/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Similarity Criterion: Single Link Cluster similarity = similarity of two most similar members Potentially long and skinny clusters Doina Caragea 3411/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Similarity Criterion: Complete Link Cluster similarity = similarity of two least similar members Tight clusters Doina Caragea 3511/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Similarity Criterion: Average Link Cluster similarity = average similarity of all pairs This is perhaps most widely used similarity criterion Doina Caragea 3611/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Hierarchical Clustering * Probably most popular clustering algorithm for microarray analysis First presented in this context by Eisen et al. in 1998 Nodes = genes or groups of genes Agglomerative (bottom up) 0. Initially each item is a cluster 1.Compute distance matrix 2.Find two closest nodes (most similar clusters) 3.Merge them 4.Compute distances from merged node to all others 5.Repeat until all nodes merged into a single node *This method was illustrated in Lecture 36,Tables 6.1-MM6.4 Doina Caragea 3711/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Hierachical Clustering Example: Using Single Link Criterion to Iteratively Combine Data Points Doina Caragea 3811/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Copyright: Russ Altman 3911/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Hierarchical Clustering: Strengths & Weaknesses Easy to understand & implement Can decide how big to make clusters by choosing cut level of hierarchy Can be sensitive to bad data Can have problems interpreting tree Can have local minima Bottom-up is most commonly used method Can also perform top-down, which requires splitting a large group successively Doina Caragea 4011/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering K-Means Clustering (Model-based) Computationally attractive! 1.Choose random points (cluster centers or centroids) in k dimensions 2.Compute distance from each data point to centroids 3.Assign each data point to closest centroid 4.Compute new cluster centroid as average of points assigned to cluster 5.Loop to (2), stop when cluster centroids do not move very much For K = 2 Two features: f1 (x-coordinate) & f2 (y-coordinate) Initial Centroid A Initial Centroid B 2nd Centroid A 2nd Centroid B Doina Caragea 4111/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering K-Means Clustering Example, for k=2 Steps in K-means clustering: 0. Objects: 1, 2, 5, 6, 7 1.Randomly select 5 and 6 as centers (centroids) 2.Calculate distance from points to centroids & assign points to clusters: {1,2,5} & {6,7} 3.Compute new cluster centroids: (C 1 ) = 8/3 = 2.7 (C 2 ) = 13/2= Calculate distance from points to new centroids & assign data points to new clusters: {1,2} & {5,6,7} 5. Compute new cluster centroids: (C 1 ) = 1.5 (C 2 ) = No change ? Converged! => Final clusters = {1,2} & {5,6,7} For simplicity, assume k=2 & objects are 1-dimensional (Numerical difference is used as distance) Doina Caragea 4211/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering K Means Clustering for k=2 A more realistic example Pick seeds Assign clusters Compute centroids x x Re-assign clusters x x x x Compute centroids Re-assign clusters Converged! From S. Mooney 4311/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering K-Means Clustering: Strengths & Weaknesses Fast, O(N) Hard to know which K to choose Try several and assess cluster quality Hard to know where to seed the clusters Results can change drastically with different initial choices for centroids - as shown in example: In the above, if start with B and E as centroids will converge to {A,B,C} and {D,E,F} If start with D and F Will converge to {A,B,D,E} {C,F} Example Illustrating Sensitivity to Seeds Doina Caragea 4411/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Choice of K? Helpful to have additional information to aid evaluation of clusters Doina Caragea 4511/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Hierarchical Clustering vs K-Means Hierarchical Clustering K-Means Running TimeSlowerFaster Assumptions Requires distance metric ParametersNone K (number of clusters) Clusters Subjective (only a tree is returned) Exactly K clusters Doina Caragea 4611/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Clustering vs Classification Clustering (unsupervised learning) Uses primary data to group measurements, with no information from other sources Classification (supervised learning) Uses known groups of interest (from other sources) to learn features associated with these groups in primary data and create rules for associating data with groups of interest Doina Caragea 4711/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Compare in Graphical Representation Apply external labels: RED group & BLUE group ClassificationClustering Doina Caragea 4811/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Tradeoffs Clustering is not biased by previous knowledge, but therefore needs stronger signal to discover clusters Classification uses previous knowledge, so can detect weaker signal, but may be biased by WRONG previous knowledge Doina Caragea 4911/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Methods for Classification K-nearest neighbors Linear Models Logistic Regression Naive Bayes Decision Trees Support Vector Machines Doina Caragea 5011/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering K-Nearest Neighbor (KNN) Idea: Use k closest neighbors to label new data points (e.g., for k = 4) Doina Caragea 5111/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Basic KNN Algorithm INPUT: Set of data with labels (training data) K Set of data needing labels Distance metric 1.For each unlabeled data point, compute distance to all labeled data 2.Sort distances, determine closest K neighbors (smallest distances) 3.Use majority voting to predict label of unlabeled data point. Doina Caragea 5211/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering SLIDES FOLLOWING THIS ONE WERE NOT SHOWN IN LECTURE Some of this is material I discussed or wrote on blackboard It is provided here for your information & for future reference It will not be covered on the Final Exam! 5311/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Microarray Technology Details re: 2 types of arrays cDNA Slides Short-oligonucleotide Chips A few words about microarray terminology: Probes refers to cDNAs or DNA oligos attached to slide or chip Target refers to labeled mRNA or cRNA in solution, which is hybridized to probes attached to slide or chip Note: this is opposite of terminology used in discussing Southern blots, etc, in which target is DNA attached to solid matrix & probe is labeled RNA or cDNA in solution, which is hybridized to targets attached to matrix 5411/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering cDNA Microarrays Glass slides or similar supports containing cDNA sequences that serve as probes for measuring mRNA levels in target samples cDNAs are arrayed on each slide in a grid of spots. Each spot contains thousands of copies of a sequence that matches a segment of a genes coding sequence. A sequence and its complement are present in the same spot. Different spots typically represent different genes, but some genes may be represented by multiple spots Dan Nettleton, ISU Statistics 416/516X 5511/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering cDNA Microarray Probes Expressed Sequence Tags (ESTs) commonly serve as probes on cDNA microarrays. ESTs are small pieces of cDNA sequence (usually 200 to 500 nucleotides long) that has been reverse-transcribed from mRNA Dan Nettleton, ISU Statistics 416/516X AAAAAAAAA...A mRNA TTTTTTTTTT...T cDNA EST 5611/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering cDNA microarray slide 2cDNA microarray slide 1 TTCCAG GATATG Each spot contains many copies of a sequence along with its complement (not shown). spot for gene 201 spot for gene 576 TTCCAG GATATG spot for gene 201 spot for gene 576 Dan Nettleton, ISU Statistics 416/516X 5711/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Spotting cDNA Probes on Microarrays Solutions containing probes are transferred from a plate to a microarray slide by a robotic arrayer. The robot picks up a small amount of solution containing a probe by dipping a pin into a well on a plate. The robot then deposits a small drop of the solution on the microarray slide by touching the pin onto the slide. The pin is washed and the process is repeated for a different probe. Most arrayers use several pins so that multiple probes are spotted simultaneously on a slide. Most arrayers print multiple slides together so that probes are deposited on several slides prior to washing. Dan Nettleton, ISU Statistics 416/516X 5811/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Spotting Probes on the Microarray 8 X 4 Print Head microarray slide plate with wells holding probes in solution All spots of the same color are made at the same time. All spots in the same sector are made by the same pin. Dan Nettleton, ISU Statistics 416/516X 5911/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering cDNA Microarrays to Measure mRNA Levels RNA is extracted from a target sample of interest. mRNAs are reverse transcribed into cDNA. The resulting cDNAs are labeled with a fluorescent dye and are incubated with the microarray slide. Dyed cDNA sequences hybridize to complementary probes spotted on the array. A laser excites the dye and a scanner records an image of the slide. The image is quantified to obtain measures of fluorescence intensity for each pixel. Pixel values are processed to obtain measures of mRNA abundance for each probe on the array. Dan Nettleton, ISU Statistics 416/516X 6011/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering cDNA Microarrays to Measure mRNA Levels (cont.) Usually two samples, dyed with different dyes, are hybridized to a single slide. The dyes fluoresce at different wavelengths so it is possible to get separate images for each dye. Images from the scanner are black and white, but it is typical to display Cy3 images as green and Cy5 images are displayed as red. It is common to superimpose the two images, using yellow to indicate a mixture of green and red. Dan Nettleton, ISU Statistics 416/516X 6111/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Problems with cDNA Microarrays: Difficult to Make Meaningful Comparisons between Genes Measures of mRNA levels are affected by several factors that are partly or completely confounded with genes (e.g., EST source plate, EST well, print pin, slide position, length of mRNA sequence, base composition of mRNA sequence, specificity of probe sequence, etc.). Within-gene comparisons of multiple cell types or across multiple treatment conditions are much more meaningful. Dan Nettleton, ISU Statistics 416/516X 6211/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering cDNA Microarrays to Measure mRNA Levels: Step 1: Prepare Microarray Slide & Sample mRNAs ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G ?????????? Sample 1 Sample 2 Microarray Slide Spots (Probes) Unknown mRNA Sequences (Target) Dan Nettleton, ISU Statistics 416/516X 6311/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering cDNA Microarrays to Measure mRNA Levels: Step 2: Convert mRNA to cDNA & label with Fluorescent Dyes ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G Sample 1 Sample 2 ?????????? Sample 1 Sample 2 Dan Nettleton, ISU Statistics 416/516X 6411/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G Sample 1 Sample 2 ?????????? cDNA Microarrays to Measure mRNA Levels: Step 3: Mix Labeled cDNA and Hybridize to Slide Dan Nettleton, ISU Statistics 416/516X 6511/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G67239 Sample 1 Sample 2 cDNA Microarrays to Measure mRNA Levels: Step 5: Excite Dye with Laser, Scan & Quantify Signals Dan Nettleton, ISU Statistics 416/516X 6611/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Pros/Cons of Spotted cDNA Arrays Many sources of variation in the manufacture of these arrays, print tips, lab, etc. Contamination Uneven distribution Flexible, can put any cDNA on slide Dan Nettleton, ISU Statistics 416/516X 6711/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering DNA Oligonucleotide Chips An oligonucleotide microarray is a microarray whose probes consist of synthetically created DNA oligonucleotides. Probes sequences are chosen to have good and relatively uniform hybridization characteristics A probe is chosen to match a portion of its target mRNA transcript that is unique to that sequence. Oligo probes can distinguish among multiple mRNA transcripts with similar sequences. Dan Nettleton, ISU Statistics 416/516X 6811/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering 6911/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering 7011/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering 7111/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Simplified Example gene 1 gene 2 Shared green regions indicate high degree of sequence similarity throughout much of the transcript ATTACTAAGCATAGATTGCCGTATA oligo probe for gene 1 GCGTATGGCATGCCCGGTAAACTGG oligo probe for gene 2... Dan Nettleton, ISU Statistics 416/516X 7211/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Oligo Microarray Fabrication Oligos can be synthesized and stored in solution for spotting as is done with cDNA microarrays. Oligo sequences can be synthesized on a slide or chip using various commercial technologies. In one approach, sequences are synthesized on a slide using ink- jet technology similar to that used in color printers. Separate cartridges for the four bases (A, C, G, T) are used to build nucleotides on a slide. Affymetrix uses a photolithographic approach. Dan Nettleton, ISU Statistics 416/516X 7311/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Affymetrix GeneChips Affymetrix (www.affymetrix.com) manufactures GeneChips, oligonucleotide arrays.www.affymetrix.com Each gene (or sequence of interest or feature) is represented by multiple short (25-nucleotide) oligo probes. Some GeneChips include probes for around 60,000 genes. mRNA that has been extracted from a biological sample can be labeled (dyed) and hybridized to a GeneChip in a manner similar to that described for cDNA microarrays. Only one sample is hybridized to each GeneChip rather than two as in the case of cDNA microarrays. Dan Nettleton, ISU Statistics 416/516X 7411/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Affymetrix Probe Sets A probe set is used to measure mRNA levels of a single gene. Each probe set consists of multiple probe cells. Each probe cell contains millions of copies of one oligo. Each oligo is intended to be 25 nucleotides in length. Probe cells in a probe set are arranged in probe pairs. Each probe pair contains a perfect match (PM) probe cell and a mismatch (MM) probe cell. A PM oligo perfectly matches part of a gene sequence. A MM oligo is identical to a PM oligo except that the middle nucleotide (13 th of 25) is intentionally replaced by its complementary nucleotide. Dan Nettleton, ISU Statistics 416/516X 7511/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering PM - 25 bases complementary to gene MM - Middle base is different Affymetrix GeneChips Spaced DNA probe pairs Reference Sequence 53 mRNA reference sequence TGTGATGGTGGGAATGGGTCAGAAGGGACTCCTATGTGGGTGACGAGGCC TTACCCAGTCTTCCCTGAGGATACAC TTACCCAGTCTTGCCTGAGGATACAC Perfect match oligo Mismatch oligo PM MM PM MM Probe Pair MM Probe Cell PM Probe Cell Probe Set Dan Nettleton, ISU Statistics 416/516X 7611/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Different Probe Pairs Represent Different Parts of the Same Gene gene sequence Probes are selected to be specific to the target gene and have good hybridization characteristics. Dan Nettleton, ISU Statistics 416/516X 7711/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Obtaining Labeled Target for Affy Chips 1.RNA single-stranded cDNA 2.Single-stranded cDNA double-stranded cDNA 3.Double-strand cDNA labeled single-stranded cRNA complementary to coding sequence Number of copies of each sequence gets amplified in conversion to cRNA. Dan Nettleton, ISU Statistics 416/516X 7811/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Pros/Cons GeneChip Arrays Consistent manufacture -> good standardization Comparable across experiments Design is time-consuming, good for large sets of chips Can only see what is on the chip Dan Nettleton, ISU Statistics 416/516X 7911/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Affymetrix Data Processing Pipeline MicroArray Suite or other analysis software Experiment preparation *.exp file Image of the scanned probe array *.dat file Probe Cell Intensity file *.cel file Analysis output *.chp file Dan Nettleton, ISU Statistics 416/516X 8011/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Reminder: Why do microarray experiments? Compare two (or more) conditions to identify differentially expressed genes Control/treatment Disease/normal Exploratory analysis What genes are expressed in response to drought stress? What gene expression changes occur during normal retinal development? Diagnostic & prognostic tool development: Can we predicting certain conditions (breast cancer vs normal) Can we identify patterns of gene expression that predict a patients response to treatment/drug? Doina Caragea 8111/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Differential Gene Expression Are there significant differences in expression level between the conditions? Analysis of Variance (ANOVA) Mutant 1Mutant 2 InoculatedControlInoculatedControl Doina Caragea 8211/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Exploratory Analysis Find patterns in data to see what genes are expressed under different conditions Analysis includes clustering methods Used when little or no prior knowledge exists about the problem Doina Caragea 8311/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Classification Learn characteristic patterns from a training set and evaluate with a test set. Classify tumor types based on expression patterns Predict disease susceptibility, stages, etc. Doina Caragea 8411/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Microarray data analysis Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA) Doina Caragea 8511/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Pre-processing Page 191 Main goal of data preprocessing is to remove any systematic bias in the data as completely as possible, while preserving variation in gene expression that occurs because of Biologically relevant changes in transcription. Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: different labeling efficiencies of Cy3, Cy5 uneven spotting of DNA onto an array surface variations in RNA purity or quantity variations in washing efficiency variations in scanning efficiency Doina Caragea 8611/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Inferential statistics Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: There is no difference in signal intensity for the gene expression measurements in normal and diseased samples. The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level a to p < Page 199 Doina Caragea 8711/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Descriptive statistics Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation Page 203 Doina Caragea 8811/26/07BCB 444/544 F07 ISU Dobbs #37- Clustering Limitations of Microarrays Link between proteins and expressed RNA not always clear Difficult to compare between microarray platforms: Only see what is on the microarray Gene finding is still an art Other coding regions, dark matter on genome But now microarrays for these are being developed, too! Doina Caragea