term filtering with bounded error

1

Zi Yang, Wei Li, Jie Tang, and Juanzi Li

Knowledge Engineering Group

Department of Computer Science and Technology

Tsinghua University, China

{yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn

[email protected]

Presented in ICDM’2010, Sydney, Australia

Term Filtering with Bounded Error

2

Outline

• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion

3

Introduction• Term-document correlation matrix

– : a co-occurrence matrix,

– Applied in various text mining tasks• text classification [Dasgupta et al., 2007]• text clustering [Blei et al., 2003]• information retrieval [Wei and Croft, 2006].

– Disadvantage• high computation-cost and intensive memory access.

the number of times that term occurs in

document How can we overcome the disadvantage?

4

Introduction (cont'd)

• Three general methods:– To improve the algorithm itself– High performance platforms

• multicore processors and distributed machines [Chu et al., 2006]

– Feature selection approaches• [Yi, 2003]

We follow this line

5


• Basic assumption– Each term captures more or less information.

• Our goal– A subspace of features (terms) – Minimal information loss.

• Conventional solution:– Step 1: Measure each individual term or the

dependency to a group of terms.• Information gain or mutual information.

– Step 2: Remove features with low scores in importance or high scores in redundancy.

Limitations- A generic definition of information

loss for a single term and a set of terms in terms of any metric?

- The information loss of each individual document, and a set of documents?

Why should we consider the information loss on documents?

6


• Consider a simple example:

• Question: any loss of information if A and B are substituted by a single term S?– Consider only the information loss for terms

• YES! Consider that is defined by some state-of-the-art pseudometrics, cosine similarity.

– Consider both• NO! Because both documents emphasize A more

than B. It’s information loss of documents!

A A B A A BDoc 1 Doc 2(2,2)

(1,1)A

B

w

wWhat we have done?- Information loss for both terms and documents- Term Filtering with Bounded Error problem- Develop efficient algorithms

7

Outline


8

Problem Definition - Superterm

• Want– less information loss– smaller term space

• Superterm: .

Group terms into superterms!

the number of occurrences of

superterm in document

9

Problem Definition – Information loss

• Information loss during term merging?– user-specified distance measures

• Euclidean metric and cosine similarity.

– superterm substitutes a set of terms • information loss of terms .• information loss of documents .

A transformed representation for document after merging

A A B A A BDoc 1 Doc 2

2 2

1 1

Doc 1 Doc 2

A

B

error 0

error 0W

D

1.5 1.5

1.5 1.5

Doc 1 Doc 2

S

S

Can be chosen with different methods: winner-take-all, average-occurrence, etc.

10

Problem Definition – TFBE

• Term Filtering with Bounded Error (TFBE) – To minimize the size of superterms subject to

the following constraints:

(1) ,

(2) for all , ,

(3) for all , .

Mapping function from terms to superterms

Bounded by user-specified errors

11

Outline


12

Lossless Term Filtering

• Special case– NO information loss of any term and document

• Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector representation.

• Algorithm– Step 1: find “local” superterms for each document

• Same occurrences within the document

– Step 2: add, split, or remove superterms from global superterm set

= 0 and = 0

This case is applicable?YES! On Baidu Baike dataset (containing 1,531,215 documents)The vocabulary size 1,522,576 -> 714,392

> 50%

13

Outline


14

Lossy Term Filtering

• General case

• A greedy algorithm– Step 1: Consider the constraint given by .

• Similar to Greedy Cover Algorithm

– Step 2: Verify the validity of the candidates with .

• Computational Complexity– – Parallelize the computation for distances

> 0 and > 0 NP-hard!

#iterations << |V|

Further reduce the size of candidate superterms?

Locality-sensitive hashing (LSH)

15

Same hash value(with several randomized

hash projections)

Close to each other (in the original space)

Efficient Candidate Superterm Generation

• LSH

(Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html)

• Basic idea:

• Hash function for the Euclidean distance

[Datar et al., 2004].

a projection vector (drawn from the

Gaussian distribution)

the width of the buckets

(assigned empirically)

Now search -near neighbors in all the buckets!

16

Outline


17

Experiments

• Settings– Datasets

• Academic: ArnetMiner (10,768 papers and 8,212 terms)

• 20-Newsgroups (18,774 postings and 61,188 terms)

– Baselines• Task-irrelevant feature selection: document

frequency criterion (DF), term strength (TS), sparse principal component analysis (SPCA)

• Supervised method (only on classification): Chi-statistic (CHIMAX)

18

Experiments (cont'd)

• Filtering results on Euclidean metric– Findings

• Errors ↑, term radio ↓• Same bound for terms,

bound for documents ↑, term ratio ↓

• Same bound for documents, bound for terms ↑, term ratio ↓

19


• The information loss in terms of error

Avg. errors Max errors

20


• Efficiency Performance of Parallel Algorithm and LSH*

* Optimal resulting from different choice of tuned

21


• Applications– Clustering (Arnet)

– Classification (20NG)

– Document retrieval (Arnet)

22

Outline


23

Conclusion

• Formally define the problem and perform a theoretical investigation

• Develop efficient algorithms for the problems

• Validate the approach through an extensive set of experiments with multiple real-world text mining applications

24

Thank you!

25

References• Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet

Allocation. Journal of Machine Learning Research 3, 993-1022.• Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y. and

Olukotun, K. (2006). Map-Reduce for Machine Learning on Multicore. In NIPS '06 pp. 281-288,.

• Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Feature selection methods for text classification. In KDD'07 pp. 230-239,.

• Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In SCG'04 pp. 253-262,.

• Wei, X. and Croft, W. B. (2006). LDA-Based Document Models for Ad-hoc Retrieval. In SIGIR '06 pp. 178-185,.

• Yi, L. (2003). Web page cleaning for web mining through feature weighting. In IJCAI '03 pp. 43-50,.

term filtering with bounded error

Documents

individual term

single term

documents term filtering

information gain

mutual information

set of terms

lossy term filteringnphard

group of terms