presentatie

26
Introduction Clustering Alignment Doctoral Seminar: Multi-document clustering and alignment Wim De Smet March 23, 2007

Upload: weedejes

Post on 15-Nov-2014

556 views

Category:

Technology


3 download

DESCRIPTION

test door wim

TRANSCRIPT

Page 1: Presentatie

Introduction Clustering Alignment

Doctoral Seminar: Multi-document clusteringand alignment

Wim De Smet

March 23, 2007

Page 2: Presentatie

Introduction Clustering Alignment

Current goals

CLASS, WP7

1. Cluster documents according to topics.

2. Align text and video

Page 3: Presentatie

Introduction Clustering Alignment

Goal

Given news stories about different events, from several sources,cluster same stories.

Page 4: Presentatie

Introduction Clustering Alignment

Clustering

Typical clustering algorithms: bag of words approach.Document-by-words matrix:

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

Page 5: Presentatie

Introduction Clustering Alignment

ClusteringDocument clustering according to word-similarity:

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0

0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

Page 6: Presentatie

Introduction Clustering Alignment

ClusteringWord clustering according to document-similarity:

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

Page 7: Presentatie

Introduction Clustering Alignment

Co-clusteringPurpose: simultaneously clustering words and documents,preserving information found in both clusterings.

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

Page 8: Presentatie

Introduction Clustering Alignment

Co-clusteringPurpose: simultaneously clustering words and documents,preserving information found in both clusterings.

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0

0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

Page 9: Presentatie

Introduction Clustering Alignment

Co-clusteringPurpose: simultaneously clustering words and documents,preserving information found in both clusterings.

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0

0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

0.5 0

0 0.5

0.4 0.4

Page 10: Presentatie

Introduction Clustering Alignment

Hierarchical Co-clustering

Hierarchical co-clustering:

1. Co-cluster documents and words.

2. For each cluster: if contains too many documents, calculatesub-matrix

3. Repeat step 1 on sub-matrix.

Page 11: Presentatie

Introduction Clustering Alignment

Bipartite Spectral Graph Partitioning: motivation

View document-by-word matrix as bipartite graph

A =

word1 word2 word2

document1 a1,1 0 0document2 0 a2,2 a2,3

document2 a3,2 a3,3 0

Page 12: Presentatie

Introduction Clustering Alignment

Bipartite Spectral Graph Partitioning: motivationDivide graph in document clusters Dm and associated word clustersWm?

Page 13: Presentatie

Introduction Clustering Alignment

Bipartite Spectral Graph Partitioning: motivation

Wm =

wj :∑i∈Dm

Aij ≥∑i∈Dl

Aij ,∀l = 1, . . . , k

Page 14: Presentatie

Introduction Clustering Alignment

Bipartite Spectral Graph Partitioning: motivation

Dm =

di :∑

j∈Wm

Aij ≥∑j∈Wl

Aij ,∀l = 1, . . . , k

Page 15: Presentatie

Introduction Clustering Alignment

Bipartite Spectral Graph Partitioning: algorithm

1. Given the m ∗ n document-by-word matrix A, calculatediagonal help-matrices D1 and D2, so that:

∀1 < i ≤ m : D1(i , i) =∑

j

Ai ,j

∀1 < j ≤ n : D2(j , j) =∑

i

Ai ,j

2. Compute An = D1−1/2 ∗ A ∗D2

−1/2

3. Take the SVD of An: SVD(An) = U ∗ Λ ∗ V∗

4. Determine k, the numbers of clusters by the eigengap:k = arg max(m≥i>1) λi−1 − λi )/λi−1, whereλ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A

Page 16: Presentatie

Introduction Clustering Alignment

Bipartite Spectral Graph Partitioning: algorithm (cont.)

5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1]

respectively, by taking columns 2 to l + 1where l = dlog2 ke,

6. Compute Z =

[D1

−1/2U[2,··· ,l+1]

D2−1/2V[2,··· ,l+1]

]and normalize the rows

of Z

7. Apply k-means to cluster the rows of Z into k clusters

8. Check for each clusters the number of documents. If this ishigher than a given treshold, construct a newdocument-by-word matrix formed by the documents andwords in the cluster, and proceed to step 1

Page 17: Presentatie

Introduction Clustering Alignment

Uses of a hierarchical co-clustering

• Documents are clustered according to topic hierarchy

• Words associated with cluster describe topic

• Words can be used for offline clustering

Page 18: Presentatie

Introduction Clustering Alignment

Entries of document-by-word matrix

1. TF-IDF

2. WP 2’s Salience

Page 19: Presentatie

Introduction Clustering Alignment

Results

Precision of clustering 367 news stories from ABC and CNN.k = defined by eigengapSalience: 3743 words / TF-IDF: 7242 words

Co-clusteringTest set Precision Recall F1

Salience 74.6 % 41 % 52.9 %TF-IDF 50.4 % 40.7 % 45.1 %

k-meansTest set Precision Recall F1

Salience 69.5 % 37.1 % 48.4 %TF-IDF 38.3 % 41.8 % 40 %

Page 20: Presentatie

Introduction Clustering Alignment

Results

Precision of clustering 367 news stories from ABC and CNN.k = defined by eigengap

Co-clusteringTest set Precision Recall F1

Salience 64.3 % 48.3 % 55.2 %

k-meansTest set Precision Recall F1

Salience 58.3 % 41.7 % 48.8 %

Page 21: Presentatie

Introduction Clustering Alignment

Goals

1. Find aligning segments in

1.1 text-text pairs1.2 text-video pairs

2. Expand to multiple documents (text and video)

Page 22: Presentatie

Introduction Clustering Alignment

Goals

Using aligned segments:

• Create elaborated story from several sources

• Create links between video and text

• Summarize video and text

• Select appropriate medial form for information

Page 23: Presentatie

Introduction Clustering Alignment

Segments

Segments can be defined at different resolutions

• in text:• word• sentence• paragraph

• in video:• image• shot

• Expand to multiple documents (text and video)

Page 24: Presentatie

Introduction Clustering Alignment

Problems

• Degrees of comparability:• Parallel pairs• Near-parallel pairs• Comparable pairs

• Representation of segments in different media: how tocompare

Page 25: Presentatie

Introduction Clustering Alignment

Techniques

• Micro-macro aligment• Top-down• Bottom-up

• Make use of severalassumptions:

• Linearity• Low variance of slope• Injectivity

• Annealing and Context

Page 26: Presentatie

Introduction Clustering Alignment

Multiple documents

Two possible directions

1. Dimension reduction

2. Expand dimensions of search algorithms