term filtering with bounded error
DESCRIPTION
Term Filtering with Bounded Error. Zi Yang , Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China { yangzi , tangjie , ljz }@ keg.cs.tsinghua.edu.cn [email protected] - PowerPoint PPT PresentationTRANSCRIPT
1
Zi Yang, Wei Li, Jie Tang, and Juanzi Li
Knowledge Engineering Group
Department of Computer Science and Technology
Tsinghua University, China
{yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn
Presented in ICDM’2010, Sydney, Australia
Term Filtering with Bounded Error
2
Outline
• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion
3
Introduction• Term-document correlation matrix
– : a co-occurrence matrix,
– Applied in various text mining tasks• text classification [Dasgupta et al., 2007]• text clustering [Blei et al., 2003]• information retrieval [Wei and Croft, 2006].
– Disadvantage• high computation-cost and intensive memory access.
the number of times that term occurs in
document How can we overcome the disadvantage?
4
Introduction (cont'd)
• Three general methods:– To improve the algorithm itself– High performance platforms
• multicore processors and distributed machines [Chu et al., 2006]
– Feature selection approaches• [Yi, 2003]
We follow this line
5
Introduction (cont'd)
• Basic assumption– Each term captures more or less information.
• Our goal– A subspace of features (terms) – Minimal information loss.
• Conventional solution:– Step 1: Measure each individual term or the
dependency to a group of terms.• Information gain or mutual information.
– Step 2: Remove features with low scores in importance or high scores in redundancy.
Limitations- A generic definition of information
loss for a single term and a set of terms in terms of any metric?
- The information loss of each individual document, and a set of documents?
Why should we consider the information loss on documents?
6
Introduction (cont'd)
• Consider a simple example:
• Question: any loss of information if A and B are substituted by a single term S?– Consider only the information loss for terms
• YES! Consider that is defined by some state-of-the-art pseudometrics, cosine similarity.
– Consider both• NO! Because both documents emphasize A more
than B. It’s information loss of documents!
A A B A A BDoc 1 Doc 2(2,2)
(1,1)A
B
w
wWhat we have done?- Information loss for both terms and documents- Term Filtering with Bounded Error problem- Develop efficient algorithms
7
Outline
• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion
8
Problem Definition - Superterm
• Want– less information loss– smaller term space
• Superterm: .
Group terms into superterms!
the number of occurrences of
superterm in document
9
Problem Definition – Information loss
• Information loss during term merging?– user-specified distance measures
• Euclidean metric and cosine similarity.
– superterm substitutes a set of terms • information loss of terms .• information loss of documents .
A transformed representation for document after merging
A A B A A BDoc 1 Doc 2
2 2
1 1
Doc 1 Doc 2
A
B
error 0
error 0W
D
1.5 1.5
1.5 1.5
Doc 1 Doc 2
S
S
Can be chosen with different methods: winner-take-all, average-occurrence, etc.
10
Problem Definition – TFBE
• Term Filtering with Bounded Error (TFBE) – To minimize the size of superterms subject to
the following constraints:
(1) ,
(2) for all , ,
(3) for all , .
Mapping function from terms to superterms
Bounded by user-specified errors
11
Outline
• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion
12
Lossless Term Filtering
• Special case– NO information loss of any term and document
• Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector representation.
• Algorithm– Step 1: find “local” superterms for each document
• Same occurrences within the document
– Step 2: add, split, or remove superterms from global superterm set
= 0 and = 0
This case is applicable?YES! On Baidu Baike dataset (containing 1,531,215 documents)The vocabulary size 1,522,576 -> 714,392
> 50%
13
Outline
• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion
14
Lossy Term Filtering
• General case
• A greedy algorithm– Step 1: Consider the constraint given by .
• Similar to Greedy Cover Algorithm
– Step 2: Verify the validity of the candidates with .
• Computational Complexity– – Parallelize the computation for distances
> 0 and > 0 NP-hard!
#iterations << |V|
Further reduce the size of candidate superterms?
Locality-sensitive hashing (LSH)
15
Same hash value(with several randomized
hash projections)
Close to each other (in the original space)
Efficient Candidate Superterm Generation
• LSH
(Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html)
• Basic idea:
• Hash function for the Euclidean distance
[Datar et al., 2004].
a projection vector (drawn from the
Gaussian distribution)
the width of the buckets
(assigned empirically)
Now search -near neighbors in all the buckets!
16
Outline
• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion
17
Experiments
• Settings– Datasets
• Academic: ArnetMiner (10,768 papers and 8,212 terms)
• 20-Newsgroups (18,774 postings and 61,188 terms)
– Baselines• Task-irrelevant feature selection: document
frequency criterion (DF), term strength (TS), sparse principal component analysis (SPCA)
• Supervised method (only on classification): Chi-statistic (CHIMAX)
18
Experiments (cont'd)
• Filtering results on Euclidean metric– Findings
• Errors ↑, term radio ↓• Same bound for terms,
bound for documents ↑, term ratio ↓
• Same bound for documents, bound for terms ↑, term ratio ↓
19
Experiments (cont'd)
• The information loss in terms of error
Avg. errors Max errors
20
Experiments (cont'd)
• Efficiency Performance of Parallel Algorithm and LSH*
* Optimal resulting from different choice of tuned
21
Experiments (cont'd)
• Applications– Clustering (Arnet)
– Classification (20NG)
– Document retrieval (Arnet)
22
Outline
• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion
23
Conclusion
• Formally define the problem and perform a theoretical investigation
• Develop efficient algorithms for the problems
• Validate the approach through an extensive set of experiments with multiple real-world text mining applications
24
Thank you!
25
References• Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet
Allocation. Journal of Machine Learning Research 3, 993-1022.• Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y. and
Olukotun, K. (2006). Map-Reduce for Machine Learning on Multicore. In NIPS '06 pp. 281-288,.
• Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Feature selection methods for text classification. In KDD'07 pp. 230-239,.
• Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In SCG'04 pp. 253-262,.
• Wei, X. and Croft, W. B. (2006). LDA-Based Document Models for Ad-hoc Retrieval. In SIGIR '06 pp. 178-185,.
• Yi, L. (2003). Web page cleaning for web mining through feature weighting. In IJCAI '03 pp. 43-50,.