information theoretic clustering, co-clustering and matrix approximations inderjit s. dhillon ...
Post on 25-Feb-2016
47 Views
Preview:
DESCRIPTION
TRANSCRIPT
Information Theoretic Clustering, Co-clustering and Matrix Approximations
Inderjit S. Dhillon University of Texas, Austin
Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu & D. Modha
Data Mining Seminar Series,
Mar 26, 2004
Clustering: Unsupervised Learning Grouping together of “similar” objects
Hard Clustering -- Each object belongs to a single cluster
Soft Clustering -- Each object is probabilistically assigned to clusters
Contingency Tables Let X and Y be discrete random variables
X and Y take values in {1, 2, …, m} and {1, 2, …, n} p(X, Y) denotes the joint probability distribution—if not
known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis,
analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables
High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms
Co-Clustering Simultaneously
Cluster rows of p(X, Y) into k disjoint groups Cluster columns of p(X, Y) into l disjoint groups
Key goal is to exploit the “duality” between row and column clustering to overcome sparsity and noise
Co-clustering Example for Text Data
document
word wordclusters
document clusters
Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix
Co-clustering and Information Theory View “co-occurrence” matrix as a joint probability
distribution over row & column random variables
We seek a “hard-clustering” of both rows and columns such that “information” in the compressed matrix is maximized.
XY
XY
Information Theory Concepts Entropy of a random variable X with probability
distribution p:
The Kullback-Leibler (KL) Divergence or “Relative Entropy” between two probability distributions p and q:
Mutual Information between random variables X and Y:
x
xqxpxpqpKL ))()(log()(),(
)(log)()( xpxppHx
x y ypxpyxpyxpYXI
)()(),(log),(),(
“Optimal” Co-Clustering Seek random variables and taking values
in {1, 2, …, k} and {1, 2, …, l} such that mutual information is maximized:
where = R(X) is a function of X alone where = C(Y) is a function of Y alone
X Y
XY
)ˆ,ˆ( YXI
Related Work
Distributional Clustering Pereira, Tishby & Lee (1993), Baker & McCallum
(1998) Information Bottleneck
Tishby, Pereira & Bialek(1999), Slonim, Friedman & Tishby (2001), Berkhin & Becher(2002)
Probabilistic Latent Semantic Indexing Hofmann (1999), Hofmann & Puzicha (1999)
Non-Negative Matrix Approximation Lee & Seung(2000)
Information-Theoretic Co-clustering Lemma: “Loss in mutual information” equals
p is the input distribution q is an approximation to p
Can be shown that q(x,y) is a maximum entropy approximation subject to cluster constraints.
),()ˆ|()ˆ|()ˆ,ˆ(
)),( || ),(( )ˆ,ˆ( - ),(
YXHYYHXXHYXH
yxqyxpKLYXIYXI
yyxxyypxxpyxpyxq ˆ,ˆ),ˆ|()ˆ|()ˆ,ˆ(),(
04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.
),( yxp
04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.
),( yxp
5.005.0005.005.0005.005.
)ˆ|( xxp
04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.
),( yxp
5.005.0005.005.0005.005.
36.36.28.00000028.36.36.
)ˆ|( xxp
)ˆ|( yyp
04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.
),( yxp
5.005.0005.005.0005.005.
2.2.3.003.
36.36.28.00000028.36.36.
)ˆ|( xxp
)ˆ,ˆ( yxp)ˆ|( yyp
04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.
),( yxp
036.036.028.028.036.036.036.036.028.028036.036.054.054.042.000054.054.042.000000042.054.054.000042.054.054.
5.005.0005.005.0005.005.
2.2.3.003.
36.36.28.00000028.36.36.
)ˆ|( xxp
)ˆ,ˆ( yxp)ˆ|( yyp
),( yxq
#parameters that determine q(x,y) are: )()1()( lnklkm
Decomposition Lemma
Question: How to minimize ? Following Lemma reveals the Answer:
Note that may be thought of as the “prototype” of row
cluster.
Similarly,
x xx
xyqxypKLxpyxqyxpKLˆ ˆ
))ˆ|(||)|(()()),(||),((
)ˆ|( xyq
)ˆ|()|ˆ()ˆ|()ˆ|ˆ()ˆ|()ˆ|( whereˆ
xxpxypyypxypyypxyqxx
y yy
yxqyxpKLypyxqyxpKLˆ ˆ
))ˆ|(||)|(()()),(||),((
)),(||),(( yxqyxpKL
Co-Clustering Algorithm [Step 1] Set . Start with , Compute .
[Step 2] For every row , assign it to the cluster that minimizes
[Step 3] We have . Compute .
[Step 4] For every column , assign it to the cluster that minimizes
[Step 5] We have . Compute . Iterate 2-5.
))ˆ|(||)|(( ],[ xyqxypKL ii
x
y))ˆ|(||)|(( ],1[ yxqyxpKL ii
],1[ iiq
],1[ iiiq
1i ),( ii CR ],[ iiq
),( 1 ii CR
y
),( 11 ii CR
x
Properties of Co-clustering Algorithm Main Theorem: Co-clustering “monotonically”
decreases loss in mutual information Co-clustering converges to a local minimum Can be generalized to multi-dimensional
contingency tables q can be viewed as a “low complexity” non-negative
matrix approximation q preserves marginals of p, and co-cluster statistics Implicit dimensionality reduction at each step helps
overcome sparsity & high-dimensionality Computationally economical
04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.
),( yxp
032.032.030.025.039.039.032.032.030.025.039.039.036.036.014.028.018.018.036.036.014.028.018.018.018.018.028.014.036.036.024.024.022.019.029.029.
36.0036.0005.005.000128.00
25.30.20.10.05.10.
36.36.028.000028.036.36.
)ˆ|( xxp
)ˆ,ˆ( yxp)ˆ|( yyp
),( yxq
04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.
),( yxp
046.046.020.035.025.025.028.028.033.022.043.043.034.034.015.026.019.019.034.034.015.026.019.019.018.018.028.014.036.036.018.018.028.014.036.036.
04.010003.003.0005.005.
08.12.32.18.10.20.
36.36.028.000028.036.36.
)ˆ|( xxp
)ˆ,ˆ( yxp)ˆ|( yyp
),( yxq
04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.
),( yxp
054.054.042.013.017.017.043.043.033.022.028.028.041.041.031.010.013.013.041.041.031.010.013.013.000042.054.054.000042.054.054.
04.010003.003.0005.005.
12.08.38.12.030.
36.36.28.00000028.36.36.
)ˆ|( xxp
)ˆ,ˆ( yxp)ˆ|( yyp
),( yxq
04.04.004.04.04.04.04.04.004.04.05.05.05.00005.05.05.00000005.05.05.00005.05.05.
),( yxp
036.036.028.028.036.036.036.036.028.028036.036.054.054.042.000054.054.042.000000042.054.054.000042.054.054.
5.005.0005.005.0005.005.
2.2.3.003.
36.36.28.00000028.36.36.
)ˆ|( xxp
)ˆ,ˆ( yxp)ˆ|( yyp
),( yxq
Applications -- Text Classification
Assigning class labels to text documents Training and Testing Phases
Documentcollection
Class-1
Class-m
Grouped intoclasses
Training Data
New Document
Classifier
(Learns fromTraining data)
New Document
WithAssigned
class
Feature Clustering (dimensionality reduction) Feature Selection
Feature Clustering
DocumentBag-of-words
1
m
VectorOf
words
• Select the “best” words• Throw away rest• Frequency based pruning• Information criterion based pruning
DocumentBag-of-words
VectorOf
words
1
m
Cluster#1
Cluster#k
• Do not throw away words • Cluster words instead• Use clusters as features
Word#1
Word#k
Experiments
Data sets 20 Newsgroups data
20 classes, 20000 documents Classic3 data set
3 classes (cisi, med and cran), 3893 documents Dmoz Science HTML data
49 leaves in the hierarchy 5000 documents with 14538 words Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt
Implementation Details Bow – for indexing,co-clustering, clustering and classifying
Results (20Ng) Classification Accuracy
on 20 Newsgroups data with 1/3-2/3 test-train split
Divisive clustering beats feature selection algorithms by a large margin
The effect is more significant at lower number of features
Results (Dmoz) Classification
Accuracy on Dmoz data with 1/3-2/3 test train split
Divisive Clustering is better at lower number of features
Note contrasting behavior of Naïve Bayes and SVMs
Results (Dmoz) Naïve Bayes on
Dmoz data with only 2% Training data
Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase
Divisive Clustering performs better than IG at lower training data
Science
Math Physics Social Science
Number Theory
Logic MechanicsQuantum Theory Economics Archeology
•Flat classifier builds a classifier over the leaf classes in the above hierarchy•Hierarchical Classifier builds a classifier at each internal node of the hierarchy
Hierarchical Classification
Results (Dmoz)
Dmoz data
01020304050607080
5
10 20 50
100
200
500
1000
5000
1000
0
Number of Features
% A
ccur
acy
HierarchicalFlat(DC)Flat(IG)
• Hierarchical Classifier (Naïve Bayes at each node)• Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features)• Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers
Anecdotal Evidence
Cluster 10Divisive Clustering(rec.sport.hockey)
Cluster 9Divisive Clustering(rec.sport.baseball)
Cluster 12Agglomerative Clustering
(rec.sport.hockey and rec.sport.baseball)
teamgameplay
hockeySeasonbostonchicago
pitvannhl
hitruns
Baseballbase Ballgreg
morrisTed
PitcherHitting
team detroit hockey pitching Games hitter Players rangers baseball nyi league morris player blues nhl shots Pit Vancouver buffalo ens
Top few words sorted in Clusters obtained by Divisive and
Agglomerative approaches on 20 Newsgroups data
Co-Clustering Results (CLASSIC3)
109986275138741
405954417145240
4414284784992
1-D Clustering(0.821)
Co-Clustering(0.9835)
Results – Binary (subset of 20Ng data)
15671239161467221943
941791123410417831207
1-D Clustering
Co-clustering
1-D Clustering
Co-clustering
Binary_subject(0.946,0.648)
Binary (0.852,0.67)
Precision – 20Ng dataCo-clustering
1D-clustering
IB-Double IDC
Binary 0.98 0.64 0.70
Binary_Subject 0.96 0.67 0.85
Multi5 0.87 0.34 0.5
Multi5_Subject 0.89 0.37 0.88
Multi10 0.56 0.17 0.35
Multi10_Subject 0.54 0.19 0.55
Results: Sparsity (Binary_subject data)
Results: Sparsity (Binary_subject data)
Results (Monotonicity)
Conclusions
Information-theoretic approach to clustering, co-clustering and matrix approximation
Implicit dimensionality reduction at each step to overcome sparsity & high-dimensionality
Theoretical approach has the potential of extending to other problems: Multi-dimensional co-clustering MDL to choose number of co-clusters Generalized co-clustering by Bregman divergence
More Information Email: inderjit@cs.utexas.edu Papers are available at:
http://www.cs.utexas.edu/users/inderjit “Divisive Information-Theoretic Feature Clustering for
Text Classification”, Dhillon, Mallela & Kumar, Journal of Machine Learning Research(JMLR), March 2003 (also KDD, 2002)
“Information-Theoretic Co-clustering”, Dhillon, Mallela & Modha, KDD, 2003.
“Clustering with Bregman Divergences”, Banerjee, Merugu, Dhillon & Ghosh, SIAM Data Mining Proceedings, April, 2004.
“A Generalized Maximum Entropy Approach to Bregman Co-clustering & Matrix Approximation”, Banerjee, Dhillon, Ghosh, Merugu & Modha, working manuscript, 2004.
top related