presentatie

Post on 15-Nov-2014

556 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

test door wim

TRANSCRIPT

Introduction Clustering Alignment

Doctoral Seminar: Multi-document clusteringand alignment

Wim De Smet

March 23, 2007

Introduction Clustering Alignment

Current goals

CLASS, WP7

1. Cluster documents according to topics.

2. Align text and video

Introduction Clustering Alignment

Goal

Given news stories about different events, from several sources,cluster same stories.

Introduction Clustering Alignment

Clustering

Typical clustering algorithms: bag of words approach.Document-by-words matrix:

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

Introduction Clustering Alignment

ClusteringDocument clustering according to word-similarity:

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0

0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

Introduction Clustering Alignment

ClusteringWord clustering according to document-similarity:

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

Introduction Clustering Alignment

Co-clusteringPurpose: simultaneously clustering words and documents,preserving information found in both clusterings.

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 00 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

Introduction Clustering Alignment

Co-clusteringPurpose: simultaneously clustering words and documents,preserving information found in both clusterings.

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0

0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

Introduction Clustering Alignment

Co-clusteringPurpose: simultaneously clustering words and documents,preserving information found in both clusterings.

A =

0.5 0.5 0.5 0 0 00.4 0.6 0.5 0 0 00.5 0.4 0.6 0 0 0

0 0 0 0.5 0.5 0.50 0 0 0.5 0.5 0.5

0.4 0.4 0 0.4 0.4 0.40.4 0.4 0.4 0 0.4 0.4

0.5 0

0 0.5

0.4 0.4

Introduction Clustering Alignment

Hierarchical Co-clustering

Hierarchical co-clustering:

1. Co-cluster documents and words.

2. For each cluster: if contains too many documents, calculatesub-matrix

3. Repeat step 1 on sub-matrix.

Introduction Clustering Alignment

Bipartite Spectral Graph Partitioning: motivation

View document-by-word matrix as bipartite graph

A =

word1 word2 word2

document1 a1,1 0 0document2 0 a2,2 a2,3

document2 a3,2 a3,3 0

Introduction Clustering Alignment

Bipartite Spectral Graph Partitioning: motivationDivide graph in document clusters Dm and associated word clustersWm?

Introduction Clustering Alignment

Bipartite Spectral Graph Partitioning: motivation

Wm =

wj :∑i∈Dm

Aij ≥∑i∈Dl

Aij ,∀l = 1, . . . , k

Introduction Clustering Alignment

Bipartite Spectral Graph Partitioning: motivation

Dm =

di :∑

j∈Wm

Aij ≥∑j∈Wl

Aij ,∀l = 1, . . . , k

Introduction Clustering Alignment

Bipartite Spectral Graph Partitioning: algorithm

1. Given the m ∗ n document-by-word matrix A, calculatediagonal help-matrices D1 and D2, so that:

∀1 < i ≤ m : D1(i , i) =∑

j

Ai ,j

∀1 < j ≤ n : D2(j , j) =∑

i

Ai ,j

2. Compute An = D1−1/2 ∗ A ∗D2

−1/2

3. Take the SVD of An: SVD(An) = U ∗ Λ ∗ V∗

4. Determine k, the numbers of clusters by the eigengap:k = arg max(m≥i>1) λi−1 − λi )/λi−1, whereλ1 ≥ λ2 ≥ · · · ≥ λm are the singular values of A

Introduction Clustering Alignment

Bipartite Spectral Graph Partitioning: algorithm (cont.)

5. From U and V, calculate U[2,··· ,l+1] and V[2,··· ,l+1]

respectively, by taking columns 2 to l + 1where l = dlog2 ke,

6. Compute Z =

[D1

−1/2U[2,··· ,l+1]

D2−1/2V[2,··· ,l+1]

]and normalize the rows

of Z

7. Apply k-means to cluster the rows of Z into k clusters

8. Check for each clusters the number of documents. If this ishigher than a given treshold, construct a newdocument-by-word matrix formed by the documents andwords in the cluster, and proceed to step 1

Introduction Clustering Alignment

Uses of a hierarchical co-clustering

• Documents are clustered according to topic hierarchy

• Words associated with cluster describe topic

• Words can be used for offline clustering

Introduction Clustering Alignment

Entries of document-by-word matrix

1. TF-IDF

2. WP 2’s Salience

Introduction Clustering Alignment

Results

Precision of clustering 367 news stories from ABC and CNN.k = defined by eigengapSalience: 3743 words / TF-IDF: 7242 words

Co-clusteringTest set Precision Recall F1

Salience 74.6 % 41 % 52.9 %TF-IDF 50.4 % 40.7 % 45.1 %

k-meansTest set Precision Recall F1

Salience 69.5 % 37.1 % 48.4 %TF-IDF 38.3 % 41.8 % 40 %

Introduction Clustering Alignment

Results

Precision of clustering 367 news stories from ABC and CNN.k = defined by eigengap

Co-clusteringTest set Precision Recall F1

Salience 64.3 % 48.3 % 55.2 %

k-meansTest set Precision Recall F1

Salience 58.3 % 41.7 % 48.8 %

Introduction Clustering Alignment

Goals

1. Find aligning segments in

1.1 text-text pairs1.2 text-video pairs

2. Expand to multiple documents (text and video)

Introduction Clustering Alignment

Goals

Using aligned segments:

• Create elaborated story from several sources

• Create links between video and text

• Summarize video and text

• Select appropriate medial form for information

Introduction Clustering Alignment

Segments

Segments can be defined at different resolutions

• in text:• word• sentence• paragraph

• in video:• image• shot

• Expand to multiple documents (text and video)

Introduction Clustering Alignment

Problems

• Degrees of comparability:• Parallel pairs• Near-parallel pairs• Comparable pairs

• Representation of segments in different media: how tocompare

Introduction Clustering Alignment

Techniques

• Micro-macro aligment• Top-down• Bottom-up

• Make use of severalassumptions:

• Linearity• Low variance of slope• Injectivity

• Annealing and Context

Introduction Clustering Alignment

Multiple documents

Two possible directions

1. Dimension reduction

2. Expand dimensions of search algorithms

top related