information retrieval lecture 7 introduction to information retrieval (manning et al. 2007) chapter...

13
Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

Upload: gerald-connolly

Post on 14-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Information Retrieval

Lecture 7Introduction to Information Retrieval (Manning et al. 2007)

Chapter 17

For the MSc Computer Science Programme

Dell ZhangBirkbeck, University of London

Yahoo! Hierarchy

http://dir.yahoo.com/science

dairycrops

agronomyforestry

AI

HCIcraft

missions

botany

evolution

cellmagnetism

relativity

courses

agriculture biology physics CS space

... ... ...

… (30)

... ...

Hierarchical Clustering

Build a tree-like hierarchical taxonomy (dendrogram) from a set of unlabeled documents. Divisive (top-down)

Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own.

Recursive application of a (flat) partitional clustering algorithm, e.g., kMeans (k=2) Bi-secting kMeans.

Agglomerative (bottom-up) Start with each document being a single cluster.

Eventually all documents belong to the same cluster.

Dendrogram

Clustering is obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

The number of clusters k is not required in advance.

Dendrogram – Example

Clusters of News Stories:Reuters RCV1

Dendrogram – Example

Clusters of Things that People Want: ZEBO

HAC

Hierarchical Agglomerative Clustering Starts with each doc in a separate cluster. Repeat until there is only one cluster:

Among the current clusters, determine the pair of clusters, ci and cj, that are most similar. (Single-Link, Complete-Link, etc.)

Then merges ci and cj to a single cluster. The history of merging forms a binary tree or

hierarchy.

Single-Link

The similarity between a pair of clusters is defined by the single strongest link (i.e., maximum cosine-similarity) between their members:

After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(max),(,

yxsimccsimji cycx

ji

),(),,(max)),(( kjkikji ccsimccsimcccsim

HAC – Example

HAC – Example

HAC – Example

d1

d2

d3

d4

d5

d1,d2

d4,d5

d3

d3,d4,d5

As clusters agglomerate, docs are likely to fall into a dendrogram.

HAC – Example Single-Link

Take Home Message

Single-Link HAC Dendrogram

),(max),(,

yxsimccsimji cycx

ji

),(),,(max)),(( kjkikji ccsimccsimcccsim