presentatie

Introduction Clustering Alignment

Doctoral Seminar: Multi-document clusteringand alignment

Wim De Smet

March 23, 2007


Current goals

CLASS, WP7

1. Cluster documents according to topics.

2. Align text and video


Goal

Given news stories about different events, from several sources,cluster same stories.


Clustering

Typical clustering algorithms: bag of words approach.Document-by-words matrix:

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4


ClusteringDocument clustering according to word-similarity:

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0

0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4


ClusteringWord clustering according to document-similarity:

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4


Co-clusteringPurpose: simultaneously clustering words and documents,preserving information found in both clusterings.

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4



A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0

0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4



A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0

0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

0.5 0

0 0.5

0.4 0.4


Hierarchical Co-clustering

Hierarchical co-clustering:

1. Co-cluster documents and words.

2. For each cluster: if contains too many documents, calculatesub-matrix

3. Repeat step 1 on sub-matrix.


Bipartite Spectral Graph Partitioning: motivation

View document-by-word matrix as bipartite graph

A =

word1 word2 word2

document1 a1,1 0 0document2 0 a2,2 a2,3

document2 a3,2 a3,3 0


Bipartite Spectral Graph Partitioning: motivationDivide graph in document clusters Dm and associated word clustersWm?



Wm =

wj :∑i∈Dm

Aij ≥∑i∈Dl

Aij ,∀l = 1, . . . , k



Dm =

di :∑

j∈Wm

Aij ≥∑j∈Wl

Aij ,∀l = 1, . . . , k


Bipartite Spectral Graph Partitioning: algorithm

1. Given the m ∗ n document-by-word matrix A, calculatediagonal help-matrices D1 and D2, so that:

∀1 < i ≤ m : D1(i , i) =∑

j

Ai ,j

∀1 < j ≤ n : D2(j , j) =∑

i

Ai ,j

2. Compute An = D1−1/2 ∗ A ∗D2

−1/2

3. Take the SVD of An: SVD(An) = U ∗ Λ ∗ V∗

4. Determine k, the numbers of clusters by the eigengap:k = arg max(m≥i>1) λi−1 − λi )/λi−1, whereλ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A


Bipartite Spectral Graph Partitioning: algorithm (cont.)

5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1]

respectively, by taking columns 2 to l + 1where l = dlog2 ke,

6. Compute Z =

[D1

−1/2U[2,··· ,l+1]

D2−1/2V[2,··· ,l+1]

]and normalize the rows

of Z

7. Apply k-means to cluster the rows of Z into k clusters

8. Check for each clusters the number of documents. If this ishigher than a given treshold, construct a newdocument-by-word matrix formed by the documents andwords in the cluster, and proceed to step 1


Uses of a hierarchical co-clustering

• Documents are clustered according to topic hierarchy

• Words associated with cluster describe topic

• Words can be used for offline clustering


Entries of document-by-word matrix

1. TF-IDF

2. WP 2’s Salience


Results

Precision of clustering 367 news stories from ABC and CNN.k = defined by eigengapSalience: 3743 words / TF-IDF: 7242 words

Co-clusteringTest set Precision Recall F1

Salience 74.6 % 41 % 52.9 %TF-IDF 50.4 % 40.7 % 45.1 %

k-meansTest set Precision Recall F1

Salience 69.5 % 37.1 % 48.4 %TF-IDF 38.3 % 41.8 % 40 %


Results

Precision of clustering 367 news stories from ABC and CNN.k = defined by eigengap

Co-clusteringTest set Precision Recall F1

Salience 64.3 % 48.3 % 55.2 %

k-meansTest set Precision Recall F1

Salience 58.3 % 41.7 % 48.8 %


Goals

1. Find aligning segments in

1.1 text-text pairs1.2 text-video pairs

2. Expand to multiple documents (text and video)


Goals

Using aligned segments:

• Create elaborated story from several sources

• Create links between video and text

• Summarize video and text

• Select appropriate medial form for information


Segments

Segments can be defined at different resolutions

• in text:• word• sentence• paragraph

• in video:• image• shot

• Expand to multiple documents (text and video)


Problems

• Degrees of comparability:• Parallel pairs• Near-parallel pairs• Comparable pairs

• Representation of segments in different media: how tocompare


Techniques

• Micro-macro aligment• Top-down• Bottom-up

• Make use of severalassumptions:

• Linearity• Low variance of slope• Injectivity

• Annealing and Context


Multiple documents

Two possible directions

1. Dimension reduction

2. Expand dimensions of search algorithms

presentatie

Technology

clustering words

words matrix

cocluster documents

word matrix

motivationdivide graph

document clusters

n document

aij aij