visualization and navigation of document information spaces using a self-organizing map
DESCRIPTION
Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map. Daniel X. Pape Community Architectures for Network Information Systems [email protected] www.canis.uiuc.edu CSNA’98 6/18/98. Overview. Self-Organizing Map (SOM) Algorithm - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/1.jpg)
Visualization and Navigation of Document Information Spaces Using a
Self-Organizing Map
Daniel X. PapeCommunity Architectures for Network Information Systems
CSNA’98 6/18/98
![Page 2: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/2.jpg)
Overview
• Self-Organizing Map (SOM) Algorithm
• U-Matrix Algorithm for SOM Visualization
• SOM Navigation Application
• Document Representation and Collection Examples
• Problems and Optimizations
• Future Work
![Page 3: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/3.jpg)
Basic SOM Algorithm
• Input– Number (n) of Feature Vectors (x)– format:
vector name: a, b, c, d
– examples:1: 0.1, 0.2, 0.3, 0.4
2: 0.2, 0.3, 0.3, 0.2
![Page 4: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/4.jpg)
Basic SOM Algorithm
• Output– Neural network Map of (M) Nodes– Each node has an associated Weight Vector (m)
of the same dimensionality as the input feature vectors
– Examples:m1: 0.1, 0.2, 0.3, 0.4
m2: 0.2, 0.3, 0.3, 0.2
![Page 5: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/5.jpg)
Basic SOM Algorithm
• Output (cont.)– Nodes laid out in a grid:
![Page 6: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/6.jpg)
Basic SOM Algorithm
• Other Parameters– Number of timesteps (T)– Learning Rate (eta)
![Page 7: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/7.jpg)
Basic SOM AlgorithmSOM() {
foreach timestep t {
foreach feature vector fv {
wnode = find_winning_node(fv)
update_local_neighborhood(wnode)
}
}
}
find_winning_node() {
foreach node n {
compute distance of m to feature vector
}
return node with the smallest distance
}
update_local_neighborhood(wnode) {
foreach node n {
m = m + eta [x - m]
}
}
![Page 8: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/8.jpg)
U-Matrix Visualization
• Provides a simple way to visualize cluster boundaries on the map
• Simple algorithm:– for each node in the map, compute the average
of the distances between its weight vector and those of its immediate neighbors
• Average distance is a measure of a node’s similarity between it and its neighbors
![Page 9: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/9.jpg)
U-Matrix Visualization
• Interpretation– one can encode the U-Matrix measurements as
greyscale values in an image, or as altitudes on a terrain
– landscape that represents the document space: the valleys, or dark areas are the clusters of data, and the mountains, or light areas are the boundaries between the clusters
![Page 10: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/10.jpg)
U-Matrix Visualization
• Example:– dataset of random three dimensional points,
arranged in four obvious clusters
![Page 11: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/11.jpg)
U-Matrix Visualization
Four (color-coded) clusters of three-dimensional points
![Page 12: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/12.jpg)
U-Matrix Visualization
Oblique projection of a terrain derived from the U-Matrix
![Page 13: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/13.jpg)
U-Matrix Visualization
Terrain for a real document collection
![Page 14: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/14.jpg)
Current Labeling Procedure
• Feature vectors are encoded as 0’s and 1’s
• Weight vectors have real values from 0 to 1
• Sort weight vector dimensions by element value– dimension with greatest value is “best” noun
phrase for that node
• Aggregate nodes with the same “best” noun phrase into groups
![Page 15: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/15.jpg)
Umatrix Navigation
• 3D Space-Flight
• Hierarchical Navigation
![Page 16: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/16.jpg)
Document Data
• Noun phrases extracted
• Set of unique noun phrases computed– each noun phrase becomes a dimension of the
data set
• Each document represented by a binary vector with a 1 or a 0 denoting the existence or absence of each noun phrase
![Page 17: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/17.jpg)
Document Data
• Example:– 10 total noun phrases:
alexander, king, macedonians, darius, philip, horse, soldiers, battle, army, death
– each element of the feature vector will be a 1 or a 0:
• 1: 1, 1, 0, 0, 1, 1, 0, 0, 0, 0
• 2: 0, 1, 0, 1, 0, 0, 1, 1, 1, 1
![Page 18: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/18.jpg)
Document Collection Examples
Number ofDocuments
Number of NounPhrases
Execution Time
Biosis 1,194 2,032 17 days
Ancien-l 6,703 34,486 66 days
Compendex 162,338 22,324 ~3.4 years
Cancerlit 624,674 16,882 ~12.1 years
![Page 19: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/19.jpg)
Problems
• As document sets get larger, the feature vectors get longer, use more memory, etc.
• Execution time grows to unrealistic lengths
![Page 20: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/20.jpg)
Solutions?
• Need algorithm refinements for sparse feature vectors
• Need a faster way to do the find_winning_node() computation
• Need a better way to do the update_local_neighborhood() computation
![Page 21: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/21.jpg)
Sparse Vector Optimization
• Intelligent support for sparse feature vectors– saves on memory usage– greatly improves speed of the weight vector
update computation
![Page 22: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/22.jpg)
Faster find_winning_node()
• SOM weight vectors become partially ordered very quickly
![Page 23: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/23.jpg)
Faster find_winning_node()
U-Matrix Visualization of an Initial, Unordered SOM
![Page 24: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/24.jpg)
Faster find_winning_node()
Partially Ordered SOM after 5 timesteps
![Page 25: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/25.jpg)
Faster find_winning_node()
• Don’t do a global search for the winner
• Start search from last known winner position
• Pro:– usually finds a new winner very quickly
• Con:– this new search for a winner can sometimes get
stuck in a local minima
![Page 26: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/26.jpg)
Better Neighborhood Update
• Nodes get told to “update” quite often
• Weight vector is made public only during a find_winner() search
• With local find_winning_node() search, a lazy neighborhood weight vector update can be performed
![Page 27: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/27.jpg)
Better Neighborhood Update
• Cache update requests– each node will store the winning node and
feature vector for each update request
• The node performs the update computations called for by the stored update requests only when asked for its weight vector
• Possible reduction of number of requests by averaging the feature vectors in the cache
![Page 28: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/28.jpg)
New Execution Times
Execution Time Speedup
Biosis 2.3 hours 180x
Ancien-l 10.2 hours 160x
Compendex ~8.4 days 150x
Cancerlit ~ 1 month 150x
![Page 29: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/29.jpg)
Future Work
• Parallelization
• Label Problem
![Page 30: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/30.jpg)
Label Problem
• Current Procedure not very good
• Cluster boundaries
• Term selection
![Page 31: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/31.jpg)
Cluster Boundaries
• Image processing
• Geometric
![Page 32: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/32.jpg)
Cluster Boundaries
• Image processing example:
![Page 33: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map](https://reader035.vdocuments.site/reader035/viewer/2022070415/56814eee550346895dbc7d77/html5/thumbnails/33.jpg)
Term Selection
• Too many unique noun phrases– Too many dimensions in the feature vector data
• “Knee” of frequency curve