9/18/2001information organization and retrieval vector representation, term weights and clustering...

9/18/2001 Information Organization and Retrieval

Vector Representation, Term Weights and Clustering

(continued)

Ray Larson & Warren Sack

University of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and RetrievalLecture authors: Marti Hearst & Ray Larson & Warren Sack


Last Time

• Document Vectors

• Inverted Files

• Vector Space Model

• Term Weighting

• Clustering


Document Vectors

nova galaxy heat h’wood film role diet fur

10 5 3

5 10

10 8 7

9 10 5

10 10

9 10

5 7 9

6 10 2 8

7 5 1 3

ABCDEFGHI

Document ids


We Can Plot the VectorsStar

Diet

Doc about astronomyDoc about movie stars

Doc about mammal behavior


Inverted Index• This is the primary data structure for text

indexes• Main Idea:

– Invert documents into a big index

• Basic steps:– Make a “dictionary” of all the tokens in the

collection– For each token, list all the docs it occurs in.– Do a few things to reduce redundancy in the data

structure


Inverted IndexesWe have seen “Vector files” conceptually.

An Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0


Vector Space Model

• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary vectors of terms

• Queries represented the same as documents• Query and Document weights are based on

length and direction of their vector• A vector distance measure between the query

and documents is used to rank retrieved documents

• This makes partial matching possible


Documents in 3D Space

Assumption: Documents that are “close together” in space are similar in meaning.


Assigning Weights• tf x idf measure:

– term frequency (tf)– inverse document frequency (idf) -- a way

to deal with the problems of the Zipf distribution

• Goal: assign a tf * idf weight to each term in each document


Similarity Measures

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

DQ

DQ

DQ

DQ

DQDQ

DQ

DQ

DQ

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient


Computing Similarity Scores

2

1 1D

Q2D

98.0cos

74.0cos

)8.0 ,4.0(

)7.0 ,2.0(

)3.0 ,8.0(

2

1

2

1

Q

D

D

1.0

0.8

0.6

0.8

0.4

0.60.4 1.00.2

0.2


Text ClusteringClustering is

“The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990

Term 1

Term 2


Text Clustering

Term 1

Term 2

Clustering is“The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990


Types of Clustering

• Hierarchical vs. Flat

• Hard vs.Soft vs. Disjunctive

(set vs. uncertain vs. multiple assignment)


Flat Clustering

• K-Means – Hard– O(n)

• EM (soft version of K-Means)


K-Means Clustering

• 1 Create a pair-wise similarity measure• 2 Find K centers • 3 Assign each document to nearest center,

forming new clusters• 4 Repeat 3 as necessary


Scatter/Gather

Cutting, Pedersen, Tukey & Karger 92, 93Hearst & Pedersen 95

• Cluster sets of documents into general “themes”, like a table of contents

• Display the contents of the clusters by showing topical terms and typical titles

• User chooses subsets of the clusters and re-clusters the documents within

• Resulting new groups have different “themes”


Scatter/Gather Example: query on “star”

Encyclopedia text14 sports

8 symbols 47 film, tv 68 film, tv (p) 7 music97 astrophysics 67 astronomy(p) 12 stellar phenomena 10 flora/fauna 49 galaxies, stars

29 constellations 7 miscelleneous

Clustering and re-clustering is entirely automated


Another use of clustering

• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.

• “Project” these onto a 2D graphical representation:


Clustering Multi-Dimensional Document Space

Wise, Thomas, Pennock, Lantrip, Pottier, Schur, Crow“Visualizing the Non-Visual: Spatial analysis and interaction with Information from text documents,” 1995


Clustering Multi-Dimensional Document Space

Wise et al., 1995


Concept “Landscapes”Browsing without search

Pharmocology

Anatomy

Legal

Disease

Hospitals

(e.g., Xia Lin, “Visualization for the Document Space,” 1992)

Based on Kohonen feature maps;See http://websom.hut.fi/websom/


More examples ofinformation visualization

• Stuart Card, Jock Mackinlay, Ben Schneiderman (eds.) Readings in Information Visualization (San Francisco: Morgan Kaufmann, 1999)

• Martin Dodge, www.cybergeography.org


Clustering• Advantages:

– See some main themes

• Disadvantage:– Many ways documents could group together are

hidden

• Thinking point: what is the relationship to classification systems and faceted queries?

e.g., f1: (osteoporosis OR ‘bone loss’) f2: (drugs OR pharmaceuticals) f3: (prevention OR cure)


More information on content analysis and clustering

• Christopher Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999)

• Daniel Jurafsky and James Martin, Speech and Language Processing (Upper Saddle River, NJ: Prentice Hall, 2000)


And now on to…

• Vector Space Ranking

• Probabilistic Models and Ranking

9/18/2001information organization and retrieval vector representation, term weights and clustering...

Documents

information organization

retrieval slide

retrieval documents

document slide

retrieval text clustering

retrieval inverted indexes

data structure slide

retrieval flat clusteri