mutual information mathematical biology seminar 23.5.2005
Post on 19-Dec-2015
216 views
TRANSCRIPT
![Page 1: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/1.jpg)
Mutual Information
Mathematical Biology Seminar 23.5.2005
![Page 2: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/2.jpg)
1 .Information Theory
and
,
are terms which describe any process that selects one or more objects from a set of objects.
Mathematical Biology Seminar
![Page 3: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/3.jpg)
Information Theory
Mathematical Biology Seminar
Uncertainty = 3 SymbolABC
12 Uncertainty = 2 Symbol
A1A2B1B2C1C2 Uncertainty = 6 Symbol
Uncertainty = Log (M) M = The Number of Symbols
![Page 4: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/4.jpg)
Information Theory
Very SurprisedNot Surprised
Mathematical Biology Seminar
PU iiSurprisal log
2
01
0
UPUP
ii
ii
)log(1
log)log()log( 1 PM
MM
![Page 5: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/5.jpg)
Entropy (self information)
– a discrete random variable
- probability distribution
measure of the uncertainty information of a discrete random variable.
How certain we are of the outcome.
Mathematical Biology Seminar
)(xp
Xx
xpxpXH )(log)()(
X
![Page 6: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/6.jpg)
Entropy – properties:
maximum entropy – a uniform distribution
0)( XH
Mathematical Biology Seminar
p(x)
1log E
p(x)
1p(x)log p(x)p(x)logH(X)
2
Xx2
Xx2
![Page 7: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/7.jpg)
Mathematical Biology Seminar
![Page 8: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/8.jpg)
Joint Entropy
measure of the uncertainty between X and Y.
Mathematical Biology Seminar
Xx y
Y)p(X,y)logp(x,Y)H(X,Y
)()(),( YHXHYXH
![Page 9: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/9.jpg)
Conditional Entropy
measure the remaining uncertainty when X is known.
Mathematical Biology Seminar
X)|p(YlogE x)|p(yy)logp(x,
x)|p(yx)log|p(yp(x)
x)X|p(x)H(YX)|H(Y
Xx Yy
Xx Yy
Xx
![Page 10: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/10.jpg)
Mutual Information
It is the reduction of uncertainty of one variable due to knowing about the other, or the amount of information one variable contains about the other.
H(Y)}max{H(X),
Y)MI(X, MI : Normalize
X)|H(Y -H(Y) Y)|H(X-H(X) Y)MI(X,
___
Mathematical Biology Seminar
Y)H(X,-H(Y)H(X) X)|H(Y -H(Y) Y)|H(X-H(X) Y)MI(X,
Y)|H(XH(Y) X)|H(YH(X) Y)H(X,
MI(X,Y) 0
MI(X,Y) = 0 only when X,Y are independent: H(X|Y) = H(X).
MI(X,X) = H(X)-H(X|X) = H(X) Entropy is the self-information.
Mutual Information – properties:
![Page 11: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/11.jpg)
2 .Applications:
• Clustering algorithms
• Clustering quality
Mathematical Biology Seminar
![Page 12: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/12.jpg)
Clustering algorithms
Motivation: MI’s capability to measure a general dependence among random variables. Use MI as a similarity measure.
Minimize the statistical correlation among
clusters in contrast to distance-based algorithms which minimize the total variance within different clusters.
Mathematical Biology Seminar
![Page 13: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/13.jpg)
Clustering algorithms
Mathematical Biology Seminar
Two methods:
1. Mutual-information – MI, PMI2. Combined mutual-information and
distance-based – MIK, MIF
![Page 14: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/14.jpg)
MI – mutual information minimization
Grouping property:
1. Compute a proximity matrix based on pairwise mutual informations; assign n clusters such that each cluster contains exactly one
object; 2. find the two closest clusters i and j;3. create a new cluster (ij) by combining i and j;4. delete the lines/columns with indices i and j from the proximity matrix, and add one line/column containing the proximities between cluster (ij) and all other clusters;5. if the number of clusters is still > 2, goto (2); else join the two clusters and stop.
Mathematical Biology Seminar
)),,((),(),,( ZYXMIYXMIZYXMI
![Page 15: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/15.jpg)
PMI – threshold based on pairwise mutual information
1. Start with the first gene and grouping genes that has smallest mutual-information-based distance with it.
Repeat, until no gene can be added without surpassing the threshold.
Then start with the second gene and repeat the same procedure (all genes are available).
Repeat for all genes.
2. The largest candidate cluster is selected.
3. Repeat 1 and 2 until the K clusters.
Mathematical Biology Seminar
),(1),(___
YXMIYXd
![Page 16: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/16.jpg)
PMI
Threshold – 1. Mean of the distances of all gene pairs2 .Choose empirically
Optimal solution – simulated annealing algorithm (optimization).
cost function : )()( , jiji
XXMIsf
Mathematical Biology Seminar
)(min* sfs Ss
![Page 17: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/17.jpg)
Combined methods
Euclidean distance – positive correlation.
Mutual information – nonlinear correlation.
A small data sample size
combined algorithms
Mathematical Biology Seminar
![Page 18: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/18.jpg)
MIF - combined metric of MI and fuzzy membership distance
The objective function:
- a weight factor - , normalization constants
)(21
)1()(2
2
1 1, sf
KKcyu
Msh ki
N
i
K
kki
Mathematical Biology Seminar
10
M
1
KK 2
2
![Page 19: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/19.jpg)
Mathematical Biology Seminar
![Page 20: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/20.jpg)
Performance on simulated data
8 clustering algorithms.
measure of performance: percentage of points placed into correct clusters .
1. 4 variables:
The sample size (M) is changed .
),(),(),,,(
~,,,
43214321
4321
xxpxxpxxxxp
pBerxxxx
Mathematical Biology Seminar
5.0
![Page 21: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/21.jpg)
Mathematical Biology Seminar
![Page 22: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/22.jpg)
Performance on simulated data
Result (1):1. MI method outperforms the Fuzzy, K-
means, linkage, biclustering, PMI.2 .MIF – best clustering accuracy.
3. MIK has similar performance as the MI.4 .MI based clustering methods – more
accurate as the sample size increases.
Mathematical Biology Seminar
![Page 23: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/23.jpg)
Performance on simulated data
2. different number of genes (N) M=30
The data are generated according to:
Results (2):
In addition to the previous results…1. Performances degrade as the number of gene
increase.2. Degree of degradation depends on the
distributions governing the data.
Mathematical Biology Seminar
)()....()(),.....,,( 2121 kk XpXpXpXXXp
![Page 24: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/24.jpg)
Experimental Analysis
Clustering genes based on similarity of their expression patterns in a limited set of experiments. Gene with similar expression patterns are more likely to have similar biological function (it is not provide the best possible grouping).
Higher entropy for a gene means that its expression data are more randomly distributed.
Higher MI between genes, it is more likely that they have a biological relationship.
Mathematical Biology Seminar
![Page 25: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/25.jpg)
Experimental Analysis
Mathematical Biology Seminar
579 genes from 26 human glioma surgical tissue samples.
526 genes after filtering out genes with insufficient variability.
![Page 26: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/26.jpg)
Glioma
Gliomas are tumors that can be found in various parts of the brain. They arise from the support cells of the brain, the glial cells.
Mathematical Biology Seminar
![Page 27: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/27.jpg)
Mathematical Biology Seminar
Fuzzy K-means MIFbinary profiles
![Page 28: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/28.jpg)
Experimental Analysis
Results (Fuzzy vs. MIF):
Two small clusters were broken out from the Fuzzy clusters.
While the number of genes changed is small, the error decrease is significant (2.013 decrease to 1.084).
Mathematical Biology Seminar
![Page 29: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/29.jpg)
Experimental Analysis
Results:
The results are the same for MIK and Fuzzy.
Compared with MIF and MIK, MI and PMI gives different results.
Mathematical Biology Seminar
![Page 30: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/30.jpg)
Applications:
• Clustering algorithms
• Clustering quality
Mathematical Biology Seminar
![Page 31: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/31.jpg)
Clustering quality
What choice of number of clusters generally yields the most information about gene function (where function is known)?
9 different algorithms, 2 databases, 4 data sets.
a table of 6300 genes * 2000 attributes. a cogency table for each cluster-attribute
pairs.
Mathematical Biology Seminar
![Page 32: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/32.jpg)
Clustering quality
Calculate entropies:
and the total MI between the cluster result C and all the attributes as:
),()()(
),(),.....,,,( 21
CAHAHCHN
ACMIAAACMI
i ii iA
i iN A
Mathematical Biology Seminar
),(),(),( ii ACHAHCH
![Page 33: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/33.jpg)
1 .How does MI change?
given: 3000 genes30 clusters
Perform random swaps – the cluster sizes were held but the degree of correlation within the clusters, slowly destroy.
Mathematical Biology Seminar
![Page 34: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/34.jpg)
Results:
1. MI decreases
2. MI converges to a non-zero value
Mathematical Biology Seminar
![Page 35: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/35.jpg)
2 .Score the partition
1 .Compute MI for the clustered data. –
2 .Compute MI for clustering obtained by random swaps , Repeating until a distribution of values is obtained.
3 .Compute z-score:
random
randomreal
s
MIMIz
Mathematical Biology Seminar
realMI
randonMI
![Page 36: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/36.jpg)
Mathematical Biology Seminar
![Page 37: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/37.jpg)
large z-score greater distance
clustering results more significantly related to gene function.
Results: 1. low cluster numbers2 .clustering algorithm which produce
nonuniform cluster size distribution, perform better.
Mathematical Biology Seminar
![Page 38: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/38.jpg)
Conclusion – Advantages(1):
Very simple and natural hierarchical clustering algorithm (As MI estimates are becoming better, also the results should improve).
Optimal results when the sample size is large.
MI is a proximity measure, which also recognizes negatively and nonlinearly
correlated data set. So it is more general to use it modeling relationship between genes.
MI is not biased by outliers.Euclidian distance is more easily distorted when variables are not uniformly distributed.
Mathematical Biology Seminar
![Page 39: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/39.jpg)
Conclusion – Advantages(2):
Expression levels can be modeled to include measurement noise.
Mathematical Biology Seminar
![Page 40: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/40.jpg)
Conclusion - Disadvantages:
In general, It is not easy to estimate MI (as an example, continuous random variables).
The performances degrade substantially as the number of genes increases.
Mathematical Biology Seminar
![Page 41: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/41.jpg)
Conclusion
It is not so accurate to look at each condition as a independent observation. Each point is significant.
There are analyses on datasets which do not miss any non-linear correlations .
Its more accurate as a validation method.
Mathematical Biology Seminar
![Page 42: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/42.jpg)
Mathematical Biology Seminar
![Page 43: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/43.jpg)
Mathematical Biology Seminar
![Page 44: Mutual Information Mathematical Biology Seminar 23.5.2005](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d395503460f94a13587/html5/thumbnails/44.jpg)
Mathematical Biology Seminar