page 1 a soft subspace clustering method for text data using a probability based feature weighting...
TRANSCRIPT
Page 1
A Soft subspace clustering Method for Text Data using a Probability based Feature
Weighting Scheme
Abdul Wahid, Xiaoying Gao, Peter Andreae
Victoria University of Wellington
New Zealand
Soft subspace clustering
• Clustering normally use– all features
• Text data– too many features
• Subspace clustering use– subsets of features-----subspace
• Soft– a feature has a weight in each subspace
Research questions
• What are the subspaces• How to define the weights
– Feature to subspace
• LDA (Latent Dirichlet Allocation)– Topic modelling– Automatically detects topics
• Solution– Topics as subspace– Weight: word probability in each topic
LDA: example by Edwin Chen
• Suppose you have the following set of sentences, and you want two topics:
• I like to eat broccoli and bananas.• I ate a banana and spinach smoothie for breakfast.• Chinchillas and kittens are cute.• My sister adopted a kitten yesterday.• Look at this cute hamster munching on a piece of
broccoli.
LDA example by Edwin Chen
• Sentences 1 and 2: 100% Topic A• Sentences 3 and 4: 100% Topic B• Sentence 5: 60% Topic A, 40% Topic B
• Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
• Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)
Apply LDA
• Gibbs Sampling• Generate two matrices
– Topic--Documents matrix
– Topic – term matrix
Preprocessing
Documents LDA
Gibbs Sampl
ing
Assign Initial
Clusters
Assign Weights
𝜃
𝜙
Refine clusters
Our DWKM algorithm
• K-mean based algorithm• Use LDA to get two matrices• Use document-topic matrix to initialise the
clusters• Repeat
– Calculate the centroid of each cluster– Assign each document to the nearest centroid
• The distance measure is weighted by term-topic matrix
• Until convergence
New distance measure
Weights: word probability in a topic
Soft Subspace Clustering
Refine feature weights
LDA
Feature weighting
Initial cluster estimation
Refine clusters using feature
weights
Semantic information
Refine clusters
Our new approach
Randomly Assign feature weights
Randomly assign documents to clusters
Subspace Clustering
Hard Subspace Clustering
Common approach
Experiments
• Data sets– 4 Synthetic datasets– 6 Real data sets
• Evaluation parameters– Accuracy– F measure– NMI (Normal Mutual Information)– Entropy
• Compared with– K-means, LDA as a clustering method, FWKM, EWKM, FGKM
Resultsdatasets Metric K-means LDA FWKM EWKM FGKM DWKM
SD1 Acc 0.65 0.66 0.77 0.69 0.82 0.87
F-M 0.63 0.65 0.73 0.59 0.75 0.81
SD2 Acc 0.63 0.68 0.76 0.72 0.87 0.92
F-M 0.64 0.69 0.75 0.63 0.82 0.88
SD3 Acc 0.62 0.64 0.67 0.70 0.94 0.94
F-M 0.62 0.63 0.64 0.59 0.91 0.92
SD4 Acc 0.60 0.61 0.61 0.69 0.91 0.93
F-M 0.59 0.60 0.60 0.58 0.88 0.90
Results
Conclusion
• A new soft subspace clustering algorithm• A new distance measure• Apply LDA to get semantic information• Improved performance
Future work
• Non-parametric LDA model– No need to give number of topics
• Reduce computational complexity• Use LDA to generate different candidate
clustering solution for clustering ensembles.
Page 16