presentatie

Introduction Clustering Alignment

Doctoral Seminar: Multi-document clusteringand alignment

Wim De Smet

March 23, 2007

Current goals

CLASS, WP7

1. Cluster documents according to topics.

2. Align text and video

Given news stories about different events, from several sources,cluster same stories.

Clustering

Typical clustering algorithms: bag of words approach.Document-by-words matrix:

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

ClusteringDocument clustering according to word-similarity:

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0

0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

ClusteringWord clustering according to document-similarity:

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

Co-clusteringPurpose: simultaneously clustering words and documents,preserving information found in both clusterings.

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0

0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0

0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

0.4 0.4

Hierarchical Co-clustering

Hierarchical co-clustering:

1. Co-cluster documents and words.

2. For each cluster: if contains too many documents, calculatesub-matrix

3. Repeat step 1 on sub-matrix.

Bipartite Spectral Graph Partitioning: motivation

View document-by-word matrix as bipartite graph

word1 word2 word2

document1 a1,1 0 0document2 0 a2,2 a2,3

document2 a3,2 a3,3 0

Bipartite Spectral Graph Partitioning: motivationDivide graph in document clusters Dm and associated word clustersWm?

wj :∑i∈Dm

Aij ≥∑i∈Dl

Aij ,∀l = 1, . . . , k

di :∑

j∈Wm

Aij ≥∑j∈Wl

Aij ,∀l = 1, . . . , k

Bipartite Spectral Graph Partitioning: algorithm

1. Given the m ∗ n document-by-word matrix A, calculatediagonal help-matrices D1 and D2, so that:

∀1 < i ≤ m : D1(i , i) =∑

∀1 < j ≤ n : D2(j , j) =∑

2. Compute An = D1−1/2 ∗ A ∗D2

−1/2

3. Take the SVD of An: SVD(An) = U ∗ Λ ∗ V∗

4. Determine k, the numbers of clusters by the eigengap:k = arg max(m≥i>1) λi−1 − λi )/λi−1, whereλ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A

Bipartite Spectral Graph Partitioning: algorithm (cont.)

5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1]

respectively, by taking columns 2 to l + 1where l = dlog2 ke,

6. Compute Z =

−1/2U[2,··· ,l+1]

D2−1/2V[2,··· ,l+1]

]and normalize the rows

7. Apply k-means to cluster the rows of Z into k clusters

8. Check for each clusters the number of documents. If this ishigher than a given treshold, construct a newdocument-by-word matrix formed by the documents andwords in the cluster, and proceed to step 1

Uses of a hierarchical co-clustering

• Documents are clustered according to topic hierarchy

• Words associated with cluster describe topic

• Words can be used for offline clustering

Entries of document-by-word matrix

1. TF-IDF

2. WP 2’s Salience

Results

Precision of clustering 367 news stories from ABC and CNN.k = defined by eigengapSalience: 3743 words / TF-IDF: 7242 words

Co-clusteringTest set Precision Recall F1

Salience 74.6 % 41 % 52.9 %TF-IDF 50.4 % 40.7 % 45.1 %

k-meansTest set Precision Recall F1

Salience 69.5 % 37.1 % 48.4 %TF-IDF 38.3 % 41.8 % 40 %

Results

Precision of clustering 367 news stories from ABC and CNN.k = defined by eigengap

Co-clusteringTest set Precision Recall F1

Salience 64.3 % 48.3 % 55.2 %

k-meansTest set Precision Recall F1

Salience 58.3 % 41.7 % 48.8 %

1. Find aligning segments in

1.1 text-text pairs1.2 text-video pairs

2. Expand to multiple documents (text and video)

Using aligned segments:

• Create elaborated story from several sources

• Create links between video and text

• Summarize video and text

• Select appropriate medial form for information

Segments

Segments can be defined at different resolutions

• in text:• word• sentence• paragraph

• in video:• image• shot

• Expand to multiple documents (text and video)

Problems

• Degrees of comparability:• Parallel pairs• Near-parallel pairs• Comparable pairs

• Representation of segments in different media: how tocompare

Techniques

• Micro-macro aligment• Top-down• Bottom-up

• Make use of severalassumptions:

• Linearity• Low variance of slope• Injectivity

• Annealing and Context

Multiple documents

Two possible directions

1. Dimension reduction

2. Expand dimensions of search algorithms

presentatie

clustering words

words matrix

cocluster documents

word matrix

motivationdivide graph

document clusters

n document

aij aij

Technology

china presentatie

insel presentatie

presentatie nedstat

presentatie young kpn microsoft trend presentatie

presentatie ict2

presentatie onderzoek meetingindustrie in vlaanderen_...

presentatie waarfietsen.nl

zoetermeer presentatie

presentatie neuromarketing

presentatie kanaalzone

presentatie - algemene presentatie november 2015...microsoft...

plantijn presentatie

verdediging presentatie

presentatie miess.nl

hou heijen gezond presentatie · presentatie enquête,...

presentatie topdesk

bultheel jo ren vlaanderen kortrijk 1. wanneer is een...

een powerpoint presentatie de mogelijkheden van een...

infoavond presentatie

presentatie wange