presentatie
DESCRIPTION
test door wimTRANSCRIPT
Introduction Clustering Alignment
Doctoral Seminar: Multi-document clusteringand alignment
Wim De Smet
March 23, 2007
Introduction Clustering Alignment
Current goals
CLASS, WP7
1. Cluster documents according to topics.
2. Align text and video
Introduction Clustering Alignment
Goal
Given news stories about different events, from several sources,cluster same stories.
Introduction Clustering Alignment
Clustering
Typical clustering algorithms: bag of words approach.Document-by-words matrix:
A =
0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5
0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4
Introduction Clustering Alignment
ClusteringDocument clustering according to word-similarity:
A =
0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0
0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5
0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4
Introduction Clustering Alignment
ClusteringWord clustering according to document-similarity:
A =
0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5
0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4
Introduction Clustering Alignment
Co-clusteringPurpose: simultaneously clustering words and documents,preserving information found in both clusterings.
A =
0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5
0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4
Introduction Clustering Alignment
Co-clusteringPurpose: simultaneously clustering words and documents,preserving information found in both clusterings.
A =
0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0
0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5
0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4
Introduction Clustering Alignment
Co-clusteringPurpose: simultaneously clustering words and documents,preserving information found in both clusterings.
A =
0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0
0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5
0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4
0.5 0
0 0.5
0.4 0.4
Introduction Clustering Alignment
Hierarchical Co-clustering
Hierarchical co-clustering:
1. Co-cluster documents and words.
2. For each cluster: if contains too many documents, calculatesub-matrix
3. Repeat step 1 on sub-matrix.
Introduction Clustering Alignment
Bipartite Spectral Graph Partitioning: motivation
View document-by-word matrix as bipartite graph
A =
word1 word2 word2
document1 a1,1 0 0document2 0 a2,2 a2,3
document2 a3,2 a3,3 0
Introduction Clustering Alignment
Bipartite Spectral Graph Partitioning: motivationDivide graph in document clusters Dm and associated word clustersWm?
Introduction Clustering Alignment
Bipartite Spectral Graph Partitioning: motivation
Wm =
wj :∑i∈Dm
Aij ≥∑i∈Dl
Aij ,∀l = 1, . . . , k
Introduction Clustering Alignment
Bipartite Spectral Graph Partitioning: motivation
Dm =
di :∑
j∈Wm
Aij ≥∑j∈Wl
Aij ,∀l = 1, . . . , k
Introduction Clustering Alignment
Bipartite Spectral Graph Partitioning: algorithm
1. Given the m ∗ n document-by-word matrix A, calculatediagonal help-matrices D1 and D2, so that:
∀1 < i ≤ m : D1(i , i) =∑
j
Ai ,j
∀1 < j ≤ n : D2(j , j) =∑
i
Ai ,j
2. Compute An = D1−1/2 ∗ A ∗D2
−1/2
3. Take the SVD of An: SVD(An) = U ∗ Λ ∗ V∗
4. Determine k, the numbers of clusters by the eigengap:k = arg max(m≥i>1) λi−1 − λi )/λi−1, whereλ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A
Introduction Clustering Alignment
Bipartite Spectral Graph Partitioning: algorithm (cont.)
5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1]
respectively, by taking columns 2 to l + 1where l = dlog2 ke,
6. Compute Z =
[D1
−1/2U[2,··· ,l+1]
D2−1/2V[2,··· ,l+1]
]and normalize the rows
of Z
7. Apply k-means to cluster the rows of Z into k clusters
8. Check for each clusters the number of documents. If this ishigher than a given treshold, construct a newdocument-by-word matrix formed by the documents andwords in the cluster, and proceed to step 1
Introduction Clustering Alignment
Uses of a hierarchical co-clustering
• Documents are clustered according to topic hierarchy
• Words associated with cluster describe topic
• Words can be used for offline clustering
Introduction Clustering Alignment
Entries of document-by-word matrix
1. TF-IDF
2. WP 2’s Salience
Introduction Clustering Alignment
Results
Precision of clustering 367 news stories from ABC and CNN.k = defined by eigengapSalience: 3743 words / TF-IDF: 7242 words
Co-clusteringTest set Precision Recall F1
Salience 74.6 % 41 % 52.9 %TF-IDF 50.4 % 40.7 % 45.1 %
k-meansTest set Precision Recall F1
Salience 69.5 % 37.1 % 48.4 %TF-IDF 38.3 % 41.8 % 40 %
Introduction Clustering Alignment
Results
Precision of clustering 367 news stories from ABC and CNN.k = defined by eigengap
Co-clusteringTest set Precision Recall F1
Salience 64.3 % 48.3 % 55.2 %
k-meansTest set Precision Recall F1
Salience 58.3 % 41.7 % 48.8 %
Introduction Clustering Alignment
Goals
1. Find aligning segments in
1.1 text-text pairs1.2 text-video pairs
2. Expand to multiple documents (text and video)
Introduction Clustering Alignment
Goals
Using aligned segments:
• Create elaborated story from several sources
• Create links between video and text
• Summarize video and text
• Select appropriate medial form for information
Introduction Clustering Alignment
Segments
Segments can be defined at different resolutions
• in text:• word• sentence• paragraph
• in video:• image• shot
• Expand to multiple documents (text and video)
Introduction Clustering Alignment
Problems
• Degrees of comparability:• Parallel pairs• Near-parallel pairs• Comparable pairs
• Representation of segments in different media: how tocompare
Introduction Clustering Alignment
Techniques
• Micro-macro aligment• Top-down• Bottom-up
• Make use of severalassumptions:
• Linearity• Low variance of slope• Injectivity
• Annealing and Context
Introduction Clustering Alignment
Multiple documents
Two possible directions
1. Dimension reduction
2. Expand dimensions of search algorithms