dna chips and their analysis comp. genomics: lecture 13
TRANSCRIPT
![Page 1: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/1.jpg)
DNA Chips and Their Analysis
Comp. Genomics: Lecture 13
![Page 2: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/2.jpg)
What is a DNA Microarray?
• An experiment on the order of 10k elements
• A way to explore the function of a gene
• A snapshot of the expression level of an entire phenotype under given test conditions
![Page 3: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/3.jpg)
Some Microarray Terminology
• Probe: ssDNA printed on the solid substrate (nylon or glass) These are the genes we are going to be testing
• Target: cDNA which has been labeled and is to be washed over the probe
![Page 4: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/4.jpg)
Microarray Fabrication
• Deposition of DNA fragments– Deposition of PCR-amplified cDNA clones– Printing of already synthesized
oligonucleotieds
• In Situ synthesis– Photolithography– Ink Jet Printing– Electrochemical Synthesis
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 5: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/5.jpg)
cDNA Microarrays and Oligonucleotide Probes
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
cDNA Arrays Oligonucleotide Arrays
•Long Sequences•Spot Unknown Sequences•More variability
•Short Sequences•Spot Known Sequences•More reliable data
![Page 6: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/6.jpg)
In Situ Synthesis
• Photochemically synthesized on the chip• Reduces noise caused by PCR, cloning,
and Spotting• As previously mentioned, three kinds of In
Situ Synthesis– Photolithography– Ink Jet Printing– Electrochemical Synthesis
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 7: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/7.jpg)
Photolithography
• Similar to process used to build VLSI circuits
• Photolithographic masks are used to add each base
• If base is present, there will be a hole in the corresponding mask
• Can create high density arrays, but sequence length is limited
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
Photodeprotection
mask
C
![Page 8: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/8.jpg)
Ink Jet Printing
• Four cartridges are loaded with the four nucleotides: A, G, C,T
• As the printer head moves across the array, the nucleotides are deposited where they are needed
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 9: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/9.jpg)
Electrochemical Synthesis
• Electrodes are embedded in the substrate to manage individual reaction sites
• Electrodes are activated in necessary positions in a predetermined sequence that allows the sequences to be constructed base by base
• Solutions containing specific bases are washed over the substrate while the electrodes are activated
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 10: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/10.jpg)
Color Coding
• Tables are difficult to read• Data is presented with a color scale• Coding scheme:
– Green = repressed (less mRNA) gene in experiment
– Red = induced (more mRNA) gene in experiment
– Black = no change (1:1 ratio)
• Or– Green = control condition (e.g. aerobic)
– Red = experimental condition (e.g. anaerobic)
• We only use ratio
Campbell & Heyer, 2003
![Page 11: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/11.jpg)
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
![Page 12: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/12.jpg)
Application of Microarrays
• We only know the function of about 20% of the 30,000 genes in the Human Genome– Gene exploration– Faster and better
• Can be used for DNA computing
http://www.gene-chips.com/sample1.html
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 13: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/13.jpg)
A Data Mining Problem
• On a given Microarray we test on the order of 10k elements at a time
• Data is obtained faster than it can be processed
• We need some ways to work through this large data set and make sense of the data
![Page 14: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/14.jpg)
Example data: fold change (ratios)
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 1 8 12 16 12 8
Gene D 1 3 4 4 3 2
Gene E 1 4 8 8 8 8
Gene F 1 1 1 0.25 0.25 0.1
Gene G 1 2 3 4 3 2
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene I 1 4 8 4 1 0.5
Gene J 1 2 1 2 1 2
Gene K 1 1 1 1 3 3
Gene L 1 2 3 4 3 2
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Campbell & Heyer, 2003
![Page 15: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/15.jpg)
Example data: log2 transformation
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene C 0 3 3.58 4 3.58 3
Gene D 0 1.58 2 2 1.58 1
Gene E 0 2 3 3 3 3
Gene F 0 0 0 -2 -2 -3.32
Gene G 0 1 1.58 2 1.58 1
Gene H 0 -1 -1.60 -2 -1.60 -1
Gene I 0 2 3 2 0 -1
Gene J 0 1 0 1 0 1
Gene K 0 0 0 0 1.58 1.58
Gene L 0 1 1.58 2 1.58 1
Gene M 0 -1.60 -2 -2 -1.60 -1
Gene N 0 -3 -3.59 -4 -3.59 -3
Campbell & Heyer, 2003
![Page 16: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/16.jpg)
Pearson Correlation Coefficient, r values in [-1,1] interval
• Gene expression over time is a vector, e.g. for gene C: (0, 3, 3.58, 4, 3.58, 3)
• Given two vectors X and Y that contain N elements, we calculate r as follows:
Cho & Won, 2003
![Page 17: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/17.jpg)
Pearson Correlation Coefficient, r (cont.)
• X = Gene C = (0, 3.00, 3.58, 4, 3.58, 3)Y = Gene D = (0, 1.58, 2.00, 2, 1.58, 1)
• ∑XY = (0)(0)+(3)(1.58)+(3.58)(2)+(4)(2)+(3.58)(1.58)+(3)(1) = 28.5564
• ∑X = 3+3.58+4+3.58+3 = 17.16• ∑X2 = 32+3.582+42+3.582+32 = 59.6328• ∑Y = 1.58+2+2+1.58+1 = 8.16• ∑Y2 = 1.582+22+22+1.582+12 = 13.9928• N = 6• ∑XY – ∑X∑Y/N = 28.5564 – (17.16)(8.16)/6 = 5.2188 • ∑X2 – (∑X)2/N = 59.6328 – (17.16)2/6 = 10.5552• ∑Y2 – (∑Y)2/N = 13.9928 – (8.16)2/6 = 2.8952• r = 5.2188 / sqrt((10.5552)(2.8952)) = 0.944
![Page 18: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/18.jpg)
Example data: Pearson correlation coefficient
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 1 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.94 1 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E 0.96 0.84 1 -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.40 -0.10 -0.57 1 -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.95 -0.94 -0.89 0.35 -1 1 -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0.41 0.68 0.21 0.60 0.48 -0.48 1 0 -0.75 0.48 -0.68 -0.41
Gene J 0.36 0.24 0.30 -0.43 0.22 -0.21 0 1 0 0.22 -0.24 -0.36
Gene K 0.23 -0.07 0.43 -0.79 0.11 -0.11 -0.75 0 1 0.11 0.07 -0.23
Gene L 0.95 0.94 0.89 -0.35 1 -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene M -0.94 -1 -0.84 0.10 -0.94 0.94 -0.68 -0.24 0.07 -0.94 1 0.94
Gene N -1 -0.94 -0.96 0.40 -0.95 0.95 -0.41 -0.36 -0.23 -0.95 0.94 1Campbell & Heyer, 2003
![Page 19: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/19.jpg)
Example: Reorganization of data
Campbell & Heyer, 2003
Name 0 hours 2 hours 4 hours 6 hours 8 hours 10 hours
Gene M 1 0.33 0.25 0.25 0.33 0.5
Gene N 1 0.125 0.0833 0.0625 0.0833 0.125
Gene H 1 0.5 0.33 0.25 0.33 0.5
Gene K 1 1 1 1 3 3
Gene J 1 2 1 2 1 2
Gene E 1 4 8 8 8 8
Gene C 1 8 12 16 12 8
Gene L 1 2 3 4 3 2
Gene G 1 2 3 4 3 2
Gene D 1 3 4 4 3 2
Gene I 1 4 8 4 1 0.5
Gene F 1 1 1 0.25 0.25 0.1
![Page 20: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/20.jpg)
• Replace each entry xi by its rank in vector x.
• Then compute Pearson correlation coefficients of rank vectors.
• Example: X = Gene C = (0, 3.00, 3.41, 4, 3.58, 3.01) Y = Gene D = (0, 1.51, 2.00, 2.32, 1.58, 1)
• Ranks(X)= (1,2,4,6,5,3)• Ranks(Y)= (1,3,5,6,4,2)• Ties should be taken care of: (1) rare
(2) randomize (small effect)
Pearson Rank Correlation Coefficient
![Page 21: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/21.jpg)
Grouping and Reduction
• Grouping: discovers patterns in the data from a microarray
• Reduction: reduces the complexity of data by removing redundant probes (genes) that will be used in subsequent assays
![Page 22: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/22.jpg)
Unsupervised Grouping: Clustering
• Pattern discovery via grouping
similarly expressed genes together
• Techniques most often used k-Means Clustering Hierarchical Clustering Biclustering Additional Methods: Self Organizing Maps (SOMS),
plaid models, singular value decomposition (SVD),
order preserving submatrices (OPSM),……
![Page 23: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/23.jpg)
Clustering Overview
• Different similarity measures– Pearson Correlation Coefficient– Cosine Coefficient– Euclidean Distance– Information Gain– Mutual Information– Signal to noise ratio– Simple Matching for Nominals
![Page 24: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/24.jpg)
Clustering Overview (cont.)
• Different Clustering Methods– Unsupervised
• k-means Clustering (k nearest neighbors)• Hierarchical Clustering• Self-organizing map
– Supervised• Support vector machine• Ensemble classifier
Data Mining
![Page 25: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/25.jpg)
Clustering Limitations
• Any data can be clustered, therefore we must be careful what conclusions we draw from our results
• Clustering is often randomized and can and will produce different results for different runs on same data
![Page 26: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/26.jpg)
K-means Clustering
• Given a set of m data points in
n-dimensional space and an integer k
• We want to find the set of k points in
n-dimensional space that minimizes the Euclidean (mean squared) distance from each data point to its nearest center
• No exact polynomial-time algorithms are
known for this problem (no wonder, NP-hard!)“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo
et. al
![Page 27: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/27.jpg)
K-means Heuristic (Lloyd’s Algorithm)
• Has been shown to converge to a locally optimal solution
• But can converge to a solution arbitrarily bad compared to the optimal solution
•“K-means-type algorithms: A generalized convergence theorem and characterization of local optimality” by Selim and Ismail
•“A Local Search Approximation Algorithm for k-Means Clustering” by Kanungo et al.
K=3
Data Points
Optimal Centers
Heuristic Centers
![Page 28: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/28.jpg)
Euclidean Distance
n
iiiE yxyxd
1
2)(),(
543),( 22 AOd E
Now to find the distance between two points, say the origin and the point (3,4):
Simple and Fast! Remember this when we consider the complexity!
![Page 29: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/29.jpg)
Finding a Centroid
We use the following equation to find the n dimensional centroid point (center of mass) amid k (n dimensional) points:
),...,2
,1
(),...,,( 11121 k
xnth
k
ndx
k
stxxxxCP
k
ii
k
ii
k
ii
k
Example: Let’s find the midpoint between three 2D points, say: (2,4) (5,2) (8,9)
2 5 8 4 2 9( , ) (5,5)
3 3CP
![Page 30: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/30.jpg)
K-means Iterative Heuristic
• Choose k initial center points “randomly”• Cluster data using Euclidean distance (or other distance
metric)• Calculate new center points for each cluster, using only points within the cluster• Re-Cluster all data using the new center points
(this step could cause some data points to be placed in a different cluster)
• Repeat steps 3 & 4 until no data points are moved from one cluster to another (stabilization), or till some other convergence criteria is met
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 31: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/31.jpg)
An example with 2 clusters
1. We Pick 2 centers at random
2. We cluster our data around these center points
Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 32: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/32.jpg)
K-means example with k=2
3. We recalculate centers based on our current clusters
Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 33: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/33.jpg)
K-means example with k=2
4. We re-cluster our data around our new center points
Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 34: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/34.jpg)
K-means example with k=2
5. We repeat the last two steps until no more data points are moved into a different cluster
Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 35: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/35.jpg)
Choosing k
• Run algorithm on data with several different values of k
• Use advance knowledge about the characteristics of your test(e.g. Cancerous vs Non-Cancerous Tissues,
in case the experiments are being clustered)
![Page 36: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/36.jpg)
Cluster Quality
• Since any data can be clustered, how do we know our clusters are meaningful?– The size (diameter) of the cluster
vs. the inter-cluster distance– Distance between the members of a cluster and the
cluster’s center– Diameter of the smallest sphere containing the cluster
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 37: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/37.jpg)
Cluster Quality Continued
diameter=5
diameter=5distance=2
0
distance=5
Quality of cluster assessed by ratio of distance to nearest cluster and cluster diameter
Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 38: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/38.jpg)
Cluster Quality Continued
Quality can be assessed simply by looking at the diameter of a cluster (alone????)
A cluster can be formed by the heuristic even when there is no similarity between clustered patterns. This occurs because the algorithm forces k clusters to be created.From “Data Analysis Tools for DNA Microarrays” by
Sorin Draghici
![Page 39: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/39.jpg)
Characteristics of k-means Clustering
• The random selection of initial center points creates the following properties– Non-Determinism– May produce clusters without patterns
• One solution is to choose the centers randomly from existing patterns
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 40: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/40.jpg)
Heuristic’s Complexity
• Linear in the number of data points, N
• Can be shown to have run time cN, where c does not depend on N, but rather the number of clusters, k
• (not sure about dependence on dimension, n?)
heuristic is efficient
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 41: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/41.jpg)
Hierarchical Clustering
- a different clustering paradigm
Figure Reproduced From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 42: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/42.jpg)
Hierarchical Clustering (cont.)
Gene C Gene D Gene E Gene F Gene G Gene H Gene I Gene J Gene K Gene L Gene M Gene N
Gene C 0.94 0.96 -0.40 0.95 -0.95 0.41 0.36 0.23 0.95 -0.94 -1
Gene D 0.84 -0.10 0.94 -0.94 0.68 0.24 -0.07 0.94 -1 -0.94
Gene E -0.57 0.89 -0.89 0.21 0.30 0.43 0.89 -0.84 -0.96
Gene F -0.35 0.35 0.60 -0.43 -0.79 -0.35 0.10 0.40
Gene G -1 0.48 0.22 0.11 1 -0.94 -0.95
Gene H -0.48 -0.21 -0.11 -1 0.94 0.95
Gene I 0 -0.75 0.48 -0.68 -0.41
Gene J 0 0.22 -0.24 -0.36
Gene K 0.11 0.07 -0.23
Gene L -0.94 -0.95
Gene M 0.94
Gene N
Campbell & Heyer, 2003
![Page 43: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/43.jpg)
Hierarchical Clustering (cont.)
F
C
G
D
E
Gene C Gene D Gene E Gene F Gene G
Gene C 0.94 0.96 -0.40 0.95
Gene D 0.84 -0.10 0.94
Gene E -0.57 0.89
Gene F -0.35
Gene G
C E
1
1 Gene D Gene F Gene G
1 0.89 -0.485 0.92
Gene D -0.10 0.94
Gene F -0.35
Gene G
Average “similarity” to
Gene D: (0.94+0.84)/2 = 0.89
•Gene F: (-0.40+(-0.57))/2 = -0.485
•Gene G: (0.95+0.89)/2 = 0.92
1
![Page 44: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/44.jpg)
Hierarchical Clustering (cont.)
F
G
D
C E
1
1 Gene D Gene F Gene G
1 0.89 -0.485 0.92
Gene D -0.10 0.94
Gene F -0.35
Gene G
G D
2
![Page 45: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/45.jpg)
Hierarchical Clustering (cont.)
F
C E
1
G D
2
1 2 Gene F
1 0.905 -0.485
2 -0.225
Gene F 3
![Page 46: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/46.jpg)
Hierarchical Clustering (cont.)
F C E
1
G D
2
3
3 Gene F
3 -0.355
Gene F
4
F
![Page 47: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/47.jpg)
Hierarchical Clustering (cont.)
F C E
1
G D
2
3
4
algorithm looks familiar?
![Page 48: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/48.jpg)
Clustering of entire yeast genome
Campbell & Heyer, 2003
![Page 49: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/49.jpg)
Hierarchical Clustering:Yeast Gene Expression Data
Eisen et al., 1998
![Page 50: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/50.jpg)
A SOFM Example With Yeast
“Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation” by Tamayo et al.
![Page 51: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/51.jpg)
SOM Description
•Each unit of the SOM has a weighted
connection to all inputs•As the algorithm
progresses, neighboring units are grouped by similarity
Input Layer
Output Layer
From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
![Page 52: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/52.jpg)
An Example Using Color
Each color in the map is associated with a weight
From http://davis.wpi.edu/~matt/courses/soms/
![Page 53: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/53.jpg)
Cluster Analysis of Microarray Expression Data Matrices
Application of cluster analysis techniques in the elucidation gene
expression data
![Page 54: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/54.jpg)
The features of a living organism are governed principally by its genes.
If we want to fully understand living systems we must know the function of each gene.
Once we know a gene’s sequence we can design experiments to find its function:
However this approach is too slow to handle all the gene sequence information we have today (HGSP).
Function of Genes
Delete Gene X
Gene X
The Classical Approach of Assigning a function to a Gene
Conclusion: Gene X = left eye gene.
("זבוב בלי רגליים
– חרש")
![Page 55: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/55.jpg)
Microarray AnalysisMicroarray analysis allows the monitoring of the activities of many genes over many different conditions.Experiments are carried out on a Physical Matrix like the one below:
To facilitate computational analysis the physical matrix which may contain 1000’s of gene’s is converted into a numerical matrix using image analysis equipment.
G1G2G3G4G5G6G7G6G7
C1 C2 C3 C4 C5 C6 C7LowZeroHigh
1.55 1.05 0.5 2.5 1.75 0.25 0.1
1.7 0.3 2.4 2.9 1.5 0.5 1.0
1.55 1.05 0.5 2.5 1.75 0.25 0.1
1.7 0.3 2.4 1.5 0.5 1.0
1.55 0.5 2.5 1.75 0.25 0.1
0.3 2.4 2.9 1.5 0.5 1.0
1.55 1.05 0.5 2.5 1.75 0.25 0.1
Conditions
Genes
Possible inference:
If Gene X’s activity (expression) is affected by Condition Y (Extreme Heat), then Gene X may be involved in protecting the cellular components from extreme heat.
Each Gene has its corresponding Expression Profile for a set of conditions.
This Expression Profile may be thought of as a feature profile for that gene for that set of conditions (A condition feature profile).
![Page 56: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/56.jpg)
Cluster Analysis• Cluster Analysis is an unsupervised procedure which involves grouping of objects
based on their similarity in feature space.
• In the Gene Expression context Genes are grouped based on the similarity of their Condition feature profile.
• Cluster analysis was first applied to Gene Expression data from Brewer’s Yeast (Saccharomyces cerevisiae) by Eisen et al. (1998).
Two general conclusions can be drawn from these clusters:
• Genes clustered together may be related within a biological module/system.
• If there are genes of known function within a cluster these may help to class this biological/module system.
X
Y
A
B
C
Z
Clusters A,B and C represent groups of related genes.
Clustering
1.55 1.05 0.5 2.5 1.75 0.25 0.1
1.7 0.3 2.4 2.9 1.5 0.5 1.0
1.55 1.05 0.5 2.5 1.75 0.25 0.1
1.7 0.3 2.4 1.5 0.5 1.0
1.55 0.5 2.5 1.75 0.25 0.1
0.3 2.4 2.9 1.5 0.5 1.0
1.55 1.05 0.5 2.5 1.75 0.25 0.1
Conditions
Genes
![Page 57: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/57.jpg)
From Data to Biological Hypothesis
System C
Cluster C with four Genes may represent System C
Relating these genes aids in elucidation of this System C
Gene Expression Microarray Cluster SetConditions (A-Z)
Gene 1Gene 2Gene 3Gene 4 Gene 5Gene 6Gene 7 X
Y
A
B
C
External Stimulus( Condition X)
Regulator Protein
Toxin
DNA Gene a Gene b Gene c Gene d
Gene Expression
Toxin Pump
Cell Membrane
![Page 58: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/58.jpg)
Some Drawbacks of Clustering Biological Data1. Clustering works well over small numbers of conditions but a typical Microarray
may have hundreds of experimental conditions. A global clustering may not offer sufficient resolution with so many features.
2. As with other clustering applications, it may be difficult to cluster noisy expression data.
3. Biological Systems tend to be inter-related and may share numerous factors (Genes) – Clustering enforces partitions which may not accurately represent these intimacies.
4. Clustering Genes over all Conditions only finds the strongest signals in the dataset as a whole. More ‘local’ signals within the data matrix may be missed.
X
Y
A
B
C
Z
Inter-related(3)
Local Signals(4)
May represent more complex system such as:
![Page 59: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/59.jpg)
How do we better model more complex systems?
• One technique that allows detection of all signals in the data is biclustering.
• Instead of clustering genes over all conditions biclustering clusters genes with respect to subsets of conditions.
-interrelated clusters (genes may belong more than one bicluster).
-local signals (genes correlated over only a few conditions).
-noisy data (allows erratic genes to belong to no cluster).
This enables better representation of:
![Page 60: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/60.jpg)
Biclustering
• Technique first described by J.A. Hartigan in 1972 and termed ‘Direct Clustering’.
• First Introduced to Microarray expression data by Cheng and Church(2000)
Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6Gene 7Gene 8Gene 9
A B C D E F G H
Gene 1
Gene 4
Gene 6
Gene 7
Gene 9
B E F
BiclusteringA B D E F G H
Gene 1
Gene 4
Gene 9
Clustering misses local signal {(B,E,F),(1,4,6,7,9)} present over subset of conditions.
Gene 1
Gene 4
Gene 9
A B C D E F G H
Clustering
Biclustering discovers local coherences over a subset of conditions
Conditions
![Page 61: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/61.jpg)
Approaches to Biclustering Microarray Gene Expression
• First applied to Gene Expression Data by Cheng and Church(2000).– Used a sub-matrix scoring technique to locate biclusters.
• Tanay et al.(2000)– Modelled the expression data on Bipartite graphs and
used graph techniques to find ‘complete graphs’ or biclusters.
• Lazzeroni and Owen– Used matrix reordering to represent different ‘layers’ of
signals (biclusters) ‘Plaid Models’ to represent multiple signals within data.
• Ben-Dor et al. (2002) – “Biclusters” depending on order relations (OPSM).
![Page 62: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/62.jpg)
Bipartite Graph Modelling•First proposed in: “Discovering statically significant biclusters in
gene expressing data” Tanay et al. Bioinformatics 2000
Within the graph modelling paradigm biclusters are equivalent to complete bipartite sub-graphs.
Tanay and colleagues used probabilistic models to determine the least probable sub-graphs (those showing most order and consequently most surprising) to identify biclusters.
1234567
1234567
A B C D E FA
B
C
D
E
F
146
AD146
AD
Graph G
Sub-graph H(Bicluster)
Data Matrix M
Sub-Matrix (Bicluster)
Genes
Genes
Conditions
Conditions
![Page 63: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/63.jpg)
The Cheng and Church Approach
ija
The core element in this approach is the development of a scoring to prioritise sub-matrices.
This scoring is based on the concept of the residue of an entry in a matrix.
In the Matrix (I,J) the residue score of element is given by:
IJIjiJijij aaaaaR )(
ai
jIJ
In words, the residue of an entry is the value of the entry minus the row average, minus the column average, plus the average value in the matrix.
This score gives an idea of how the value fits into the data in the surrounding matrix.
ija
![Page 64: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/64.jpg)
The mean squared residue score (H) for a matrix (I,J) is then calculated :
JjIi
IJIjiJij aaaaJI
JIH,
2)(||||
1),(
This Global H score gives an indication of how the data fits together within that matrix- whether it has some coherence or is random.
The Cheng and Church Approach(2)
A low H score means that there is a correlation in the matrix
- a score of H(I,J)= 0 would mean that the data in the matrix fluctuates in unison i.e. the sub-matrix is a bicluster
A high H value signifies that the data is uncorrelated.
- a matrix of equally spread random values over the range [a,b], has an expected H score of (b-a)/12. range = [0,800] then H(I,J) = 53,333
![Page 65: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/65.jpg)
Worked example of H score:
IJIjiJijij aaaaaR )(
R(1) = 1- 2 - 5.4 + 6.5 = 0.1
R(2) = 2 - 2 - 6.4 + 6.5 = 0.1: :: :
R(12) = 12 - 11 -7.4 + 6.5 = 0.1
Col Avg. 5.4 6.4 7.4
1 2 34 5 67 8 9
10 11 12
Row Avg.
25811
Matrix (M) Avg. = 6.5
H (M) = (0.01x12)/12 = 0.01
If 5 was replaced with 3 then the score would changed to:
H(M2) = 2.06
If the matrix was reshuffled randomly the score would be around:
H(M3) = sqr(12-1)/12 = 10.08
![Page 66: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/66.jpg)
In order to find all possible biclusters in an Expression Matrix all sub-matrices must be tested using the H score.
The Cheng and Church Approach: Node Deletion Biclustering Algorithm
In a node deletion algorithm all columns and rows are tested for deletion. If removing a row or column decreases the H score of the Matrix than it is removed.
This continues until it is not possible to decrease the H score further. This low H score coherent sub-matrix (bicluster) is then returned.
The process then masks this located bicluster by inserting random numbers in place of it.
And reiterates the process.
R
R
R
R
Node Deletion
Node Deletion
![Page 67: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/67.jpg)
The Cheng and Church Approach:
No. of genes, no. of conditions
4, 96 10, 29 11, 25
103, 25 127, 13 13, 21
10, 57 2, 96 25, 12
9, 51 3, 96 2, 96
Some results on lymphoma data (402696):
![Page 68: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/68.jpg)
Conclusions:
• High throughput Functional Genomics (Microarrays) requires Data Mining Applications.
• Biclustering resolves Expression Data more effectively than single dimensional Cluster Analysis.
• Cheng and Church Approach offers good base for future work.
Future Research/Question’s:
• Implement a simple H score program to facilitate study if H score concept.
• Are there other alternative scorings which would better apply to gene expression data?
• Have unbiclustered genes any significance? Horizontally transferred genes?
• Implement full scale biclustering program and look at better adaptation to expression data sets and the biological context.
![Page 69: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/69.jpg)
Support Vector Machines (cont.)
support vectors
•Convex hull of points is the tightest
enclosing polygon•Maximum margin
hyperplane•Instances closest to
hyperplane are called support
vectors•Support vectors
define maximum margin hyperplane
uniquelyWitten & Frank, 2000
![Page 70: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/70.jpg)
References
•Basic microarray analysis: grouping and feature reduction by Soumya Raychaudhuri, Patrick D. Sutphin, Jeffery T. Chang and
Russ B. Altman; Trends in Biotechnology Vol. 19 No. 5 May 2001•Self Organizing Maps, Tom Germano ,
http://davis.wpi.edu/~matt/courses/soms•“Data Analysis Tools for DNA Microarrays” by Sorin Draghici;
Chapman & Hall/CRC 2003•Self-Organizing-Feature-Maps versus Statistical Clustering
Methods: A Benchmark by A. Ultsh, C. Vetter; FG Neuroinformatik & Kunstliche Intelligenz Research Report 0994
![Page 71: DNA Chips and Their Analysis Comp. Genomics: Lecture 13](https://reader035.vdocuments.site/reader035/viewer/2022062314/56649ed85503460f94be647b/html5/thumbnails/71.jpg)
References
•Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation by Tamayo et al.
•A Local Search Approximation Algorithm for k-Means Clustering by Kanungo et al.
•K-means-type algorithms: A generalized convergence theorem and characterization of local optimality by Selim
and Ismail