page 1 a soft subspace clustering method for text data using a probability based feature weighting...

A Soft subspace clustering Method for Text Data using a Probability based Feature

Weighting Scheme

Abdul Wahid, Xiaoying Gao, Peter Andreae

Victoria University of Wellington

New Zealand

Soft subspace clustering

• Clustering normally use– all features

• Text data– too many features

• Subspace clustering use– subsets of features-----subspace

• Soft– a feature has a weight in each subspace

Research questions

• What are the subspaces• How to define the weights

– Feature to subspace

• LDA (Latent Dirichlet Allocation)– Topic modelling– Automatically detects topics

• Solution– Topics as subspace– Weight: word probability in each topic

LDA: example by Edwin Chen

• Suppose you have the following set of sentences, and you want two topics:

• I like to eat broccoli and bananas.• I ate a banana and spinach smoothie for breakfast.• Chinchillas and kittens are cute.• My sister adopted a kitten yesterday.• Look at this cute hamster munching on a piece of

broccoli.

LDA example by Edwin Chen

• Sentences 1 and 2: 100% Topic A• Sentences 3 and 4: 100% Topic B• Sentence 5: 60% Topic A, 40% Topic B

• Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)

• Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

Apply LDA

• Gibbs Sampling• Generate two matrices

– Topic--Documents matrix

– Topic – term matrix

Preprocessing

Documents LDA

Gibbs Sampl

ing

Assign Initial

Clusters

Assign Weights

𝜃

𝜙

Refine clusters

Our DWKM algorithm

• K-mean based algorithm• Use LDA to get two matrices• Use document-topic matrix to initialise the

clusters• Repeat

– Calculate the centroid of each cluster– Assign each document to the nearest centroid

• The distance measure is weighted by term-topic matrix

• Until convergence

New distance measure

Weights: word probability in a topic

Soft Subspace Clustering

Refine feature weights

LDA

Feature weighting

Initial cluster estimation

Refine clusters using feature

weights

Semantic information

Refine clusters

Our new approach

Randomly Assign feature weights

Randomly assign documents to clusters

Subspace Clustering

Hard Subspace Clustering

Common approach

Experiments

• Data sets– 4 Synthetic datasets– 6 Real data sets

• Evaluation parameters– Accuracy– F measure– NMI (Normal Mutual Information)– Entropy

• Compared with– K-means, LDA as a clustering method, FWKM, EWKM, FGKM

Resultsdatasets Metric K-means LDA FWKM EWKM FGKM DWKM

SD1 Acc 0.65 0.66 0.77 0.69 0.82 0.87

F-M 0.63 0.65 0.73 0.59 0.75 0.81

SD2 Acc 0.63 0.68 0.76 0.72 0.87 0.92

F-M 0.64 0.69 0.75 0.63 0.82 0.88

SD3 Acc 0.62 0.64 0.67 0.70 0.94 0.94

F-M 0.62 0.63 0.64 0.59 0.91 0.92

SD4 Acc 0.60 0.61 0.61 0.69 0.91 0.93

F-M 0.59 0.60 0.60 0.58 0.88 0.90

Results

Conclusion

• A new soft subspace clustering algorithm• A new distance measure• Apply LDA to get semantic information• Improved performance

Future work

• Non-parametric LDA model– No need to give number of topics

• Reduce computational complexity• Use LDA to generate different candidate

clustering solution for clustering ensembles.

page 1 a soft subspace clustering method for text data using a probability based feature weighting...

Documents

topic b topic

topic bsentence

topic asentences

lda example

clustering ensembles

based algorithmuse lda

subspacesofta feature

cute hamster