9/18/2001information organization and retrieval vector representation, term weights and clustering...
Post on 19-Dec-2015
223 views
TRANSCRIPT
9/18/2001 Information Organization and Retrieval
Vector Representation, Term Weights and Clustering
(continued)
Ray Larson & Warren Sack
University of California, Berkeley
School of Information Management and Systems
SIMS 202: Information Organization and RetrievalLecture authors: Marti Hearst & Ray Larson & Warren Sack
9/18/2001 Information Organization and Retrieval
Last Time
• Document Vectors
• Inverted Files
• Vector Space Model
• Term Weighting
• Clustering
9/18/2001 Information Organization and Retrieval
Document Vectors
nova galaxy heat h’wood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3
ABCDEFGHI
Document ids
9/18/2001 Information Organization and Retrieval
We Can Plot the VectorsStar
Diet
Doc about astronomyDoc about movie stars
Doc about mammal behavior
9/18/2001 Information Organization and Retrieval
Inverted Index• This is the primary data structure for text
indexes• Main Idea:
– Invert documents into a big index
• Basic steps:– Make a “dictionary” of all the tokens in the
collection– For each token, list all the docs it occurs in.– Do a few things to reduce redundancy in the data
structure
9/18/2001 Information Organization and Retrieval
Inverted IndexesWe have seen “Vector files” conceptually.
An Inverted File is a vector file “inverted” so that rows become columns and columns become rowsdocs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …
t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0
9/18/2001 Information Organization and Retrieval
Vector Space Model
• Documents are represented as vectors in term space– Terms are usually stems– Documents represented by binary vectors of terms
• Queries represented the same as documents• Query and Document weights are based on
length and direction of their vector• A vector distance measure between the query
and documents is used to rank retrieved documents
• This makes partial matching possible
9/18/2001 Information Organization and Retrieval
Documents in 3D Space
Assumption: Documents that are “close together” in space are similar in meaning.
9/18/2001 Information Organization and Retrieval
Assigning Weights• tf x idf measure:
– term frequency (tf)– inverse document frequency (idf) -- a way
to deal with the problems of the Zipf distribution
• Goal: assign a tf * idf weight to each term in each document
9/18/2001 Information Organization and Retrieval
Similarity Measures
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
9/18/2001 Information Organization and Retrieval
Computing Similarity Scores
2
1 1D
Q2D
98.0cos
74.0cos
)8.0 ,4.0(
)7.0 ,2.0(
)3.0 ,8.0(
2
1
2
1
Q
D
D
1.0
0.8
0.6
0.8
0.4
0.60.4 1.00.2
0.2
9/18/2001 Information Organization and Retrieval
Text ClusteringClustering is
“The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990
Term 1
Term 2
9/18/2001 Information Organization and Retrieval
Text Clustering
Term 1
Term 2
Clustering is“The art of finding groups in data.” Kaufmann and Rousseeuw, Finding Groups in Data, 1990
9/18/2001 Information Organization and Retrieval
Types of Clustering
• Hierarchical vs. Flat
• Hard vs.Soft vs. Disjunctive
(set vs. uncertain vs. multiple assignment)
9/18/2001 Information Organization and Retrieval
9/18/2001 Information Organization and Retrieval
Flat Clustering
• K-Means – Hard– O(n)
• EM (soft version of K-Means)
9/18/2001 Information Organization and Retrieval
K-Means Clustering
• 1 Create a pair-wise similarity measure• 2 Find K centers • 3 Assign each document to nearest center,
forming new clusters• 4 Repeat 3 as necessary
9/18/2001 Information Organization and Retrieval
9/18/2001 Information Organization and Retrieval
9/18/2001 Information Organization and Retrieval
Scatter/Gather
Cutting, Pedersen, Tukey & Karger 92, 93Hearst & Pedersen 95
• Cluster sets of documents into general “themes”, like a table of contents
• Display the contents of the clusters by showing topical terms and typical titles
• User chooses subsets of the clusters and re-clusters the documents within
• Resulting new groups have different “themes”
9/18/2001 Information Organization and Retrieval
Scatter/Gather Example: query on “star”
Encyclopedia text14 sports
8 symbols 47 film, tv 68 film, tv (p) 7 music97 astrophysics 67 astronomy(p) 12 stellar phenomena 10 flora/fauna 49 galaxies, stars
29 constellations 7 miscelleneous
Clustering and re-clustering is entirely automated
9/18/2001 Information Organization and Retrieval
Another use of clustering
• Use clustering to map the entire huge multidimensional document space into a huge number of small clusters.
• “Project” these onto a 2D graphical representation:
9/18/2001 Information Organization and Retrieval
Clustering Multi-Dimensional Document Space
Wise, Thomas, Pennock, Lantrip, Pottier, Schur, Crow“Visualizing the Non-Visual: Spatial analysis and interaction with Information from text documents,” 1995
9/18/2001 Information Organization and Retrieval
Clustering Multi-Dimensional Document Space
Wise et al., 1995
9/18/2001 Information Organization and Retrieval
Concept “Landscapes”Browsing without search
Pharmocology
Anatomy
Legal
Disease
Hospitals
(e.g., Xia Lin, “Visualization for the Document Space,” 1992)
Based on Kohonen feature maps;See http://websom.hut.fi/websom/
9/18/2001 Information Organization and Retrieval
More examples ofinformation visualization
• Stuart Card, Jock Mackinlay, Ben Schneiderman (eds.) Readings in Information Visualization (San Francisco: Morgan Kaufmann, 1999)
• Martin Dodge, www.cybergeography.org
9/18/2001 Information Organization and Retrieval
Clustering• Advantages:
– See some main themes
• Disadvantage:– Many ways documents could group together are
hidden
• Thinking point: what is the relationship to classification systems and faceted queries?
e.g., f1: (osteoporosis OR ‘bone loss’) f2: (drugs OR pharmaceuticals) f3: (prevention OR cure)
9/18/2001 Information Organization and Retrieval
More information on content analysis and clustering
• Christopher Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999)
• Daniel Jurafsky and James Martin, Speech and Language Processing (Upper Saddle River, NJ: Prentice Hall, 2000)
9/18/2001 Information Organization and Retrieval
And now on to…
• Vector Space Ranking
• Probabilistic Models and Ranking