page 1 a soft subspace clustering method for text data using a probability based feature weighting...

16
Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria University of Wellington New Zealand

Upload: penelope-chapman

Post on 21-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Page 1

A Soft subspace clustering Method for Text Data using a Probability based Feature

Weighting Scheme

Abdul Wahid, Xiaoying Gao, Peter Andreae

Victoria University of Wellington

New Zealand

Page 2: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Soft subspace clustering

• Clustering normally use– all features

• Text data– too many features

• Subspace clustering use– subsets of features-----subspace

• Soft– a feature has a weight in each subspace

Page 3: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Research questions

• What are the subspaces• How to define the weights

– Feature to subspace

• LDA (Latent Dirichlet Allocation)– Topic modelling– Automatically detects topics

• Solution– Topics as subspace– Weight: word probability in each topic

Page 4: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

LDA: example by Edwin Chen

• Suppose you have the following set of sentences, and you want two topics:

• I like to eat broccoli and bananas.• I ate a banana and spinach smoothie for breakfast.• Chinchillas and kittens are cute.• My sister adopted a kitten yesterday.• Look at this cute hamster munching on a piece of

broccoli.

Page 5: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

LDA example by Edwin Chen

• Sentences 1 and 2: 100% Topic A• Sentences 3 and 4: 100% Topic B• Sentence 5: 60% Topic A, 40% Topic B

• Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)

• Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

Page 6: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Apply LDA

• Gibbs Sampling• Generate two matrices

– Topic--Documents matrix

– Topic – term matrix

Page 7: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Preprocessing

Documents LDA

Gibbs Sampl

ing

Assign Initial

Clusters

Assign Weights

𝜃

𝜙

Refine clusters

Page 8: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Our DWKM algorithm

• K-mean based algorithm• Use LDA to get two matrices• Use document-topic matrix to initialise the

clusters• Repeat

– Calculate the centroid of each cluster– Assign each document to the nearest centroid

• The distance measure is weighted by term-topic matrix

• Until convergence

Page 9: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

New distance measure

Weights: word probability in a topic

Page 10: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Soft Subspace Clustering

Refine feature weights

LDA

Feature weighting

Initial cluster estimation

Refine clusters using feature

weights

Semantic information

Refine clusters

Our new approach

Randomly Assign feature weights

Randomly assign documents to clusters

Subspace Clustering

Hard Subspace Clustering

Common approach

Page 11: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Experiments

• Data sets– 4 Synthetic datasets– 6 Real data sets

• Evaluation parameters– Accuracy– F measure– NMI (Normal Mutual Information)– Entropy

• Compared with– K-means, LDA as a clustering method, FWKM, EWKM, FGKM

Page 12: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Resultsdatasets Metric K-means LDA FWKM EWKM FGKM DWKM

SD1 Acc 0.65 0.66 0.77 0.69 0.82 0.87

F-M 0.63 0.65 0.73 0.59 0.75 0.81

SD2 Acc 0.63 0.68 0.76 0.72 0.87 0.92

F-M 0.64 0.69 0.75 0.63 0.82 0.88

SD3 Acc 0.62 0.64 0.67 0.70 0.94 0.94

F-M 0.62 0.63 0.64 0.59 0.91 0.92

SD4 Acc 0.60 0.61 0.61 0.69 0.91 0.93

F-M 0.59 0.60 0.60 0.58 0.88 0.90

Page 13: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Results

Page 14: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Conclusion

• A new soft subspace clustering algorithm• A new distance measure• Apply LDA to get semantic information• Improved performance

Page 15: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Future work

• Non-parametric LDA model– No need to give number of topics

• Reduce computational complexity• Use LDA to generate different candidate

clustering solution for clustering ensembles.

Page 16: Page 1 A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria

Page 16