data mining and visualization (2) alfredo vellido
TRANSCRIPT
Data Miningand visualization
(2)
Alfredo Vellido
Plan
A brief introduction to data visualization
Visualization & history
Perception
Visual exploratory DM
The good, the bad & the ugly …
Visualization Visualization recaprecap
Recap … PRINCIPLES:the data mining visual cycle, or
Visual Exploratory Data Mining
Preprocessing&
transformation
Graphicengine
Data exploratio
n
visuo-spatial model
cognitive-logicmodel
Hipothesis of reality
Control &navegatio
n
DATA
Data Manipulation
Data gathering
Model manipulation
MODEL
Recap … CRISP: Methodology phases
Recap 6 .. Data visualization vs model visualization
Recap 7 … Data visualization vs model visualization
Plan
A brief introduction to data visualization
Visualization & history
Perception
Visual exploratory DM
The good, the bad & the ugly …
What type of visualization are we looking for?
Descriptive?
Exploratory?
What type of visualization are we looking for?DESCRIPTIVE
PRINCIPLES:a good visualization should...a good visualization should...
...show data and/or results...
...at different levels of detail, from the overall landscape to the fine detail
... in a coherent manner, even if we are dealing with large collections.
... avoiding, as much as possible, distortion in their representation
...focus attention in the most relevantes features...
...minimizing the impact of uninformative and misleading data
...integrating statistical results and linguistic descriptions (if possible and relevant).
DATA EXPLORATION: The CURSE of dimensionality
Most data available to us are stored in different kinds of databases and in numeric format, mostly organized in table structures (remember survey!) An extension of these are the data cubes generated by OLAP processes.
How to display multiple dimensions? Cases:Low dimensionality (1-3D)Moderate dimensionality (4-10D)High dimensionality (>10D)
What are your preferred methods for storing data for data mining? [403 votes total]
Text, CSV (comma-separated) (72) 18%
Text, space or tab separated (55) 14%
Excel (38) 9%
SAS (57) 14%
SPSS (31) 8%
S-Plus/R (15) 4%
Weka ARFF (23) 6%
Other data mining tool format (11) 3%
In a database system (93) 23%
Other - please comment (8) 2%
DATA EXPLORATION:low-moderate dimensionality <10D-moderate dimensionality <10D
Spatial coordinates 3D requires interactivity
Further pre-cognitive visual elements allow us to “add” extra dimensions:
color, movement, shape, …
Exotic solutionsGlyph*: Chernoff faces, stick-figures, whiskers...
* Un glifo es una representación gráfica de uno o varios caracteres, o de parte de un carácter. Un carácter es una entidad textual mientras que un glifo es una entidad gráfica.
… … some of those alternativessome of those alternativesGantt diagrams…
… … some of those alternativessome of those alternatives
Chernoff faces
Herman Chernoff (1973). "Using faces to represent points in k-dimensional space graphically". Journal of the American Statistical Association 68 (342): 361–368.
DATA EXPLORATION: highhigh dimensionality data dimensionality data
How do we visualize data of high (or even very high) dimensionality?
Some of the alternatives are rather straightforward… some others are not…
Eliminate dimensions (data variables): those which are redundant and / or uninformative (at least you manage to alleviate part of the problem…)
Divide & conquer: a classic: create multiple visualizations of low dimensionality.
Latent and projection models
DATA EXPLORATION: The The Grand TourGrand Tour: multiple visualization of: multiple visualization of Iris dataIris data
www.ics.uci.edu/~mlearn/MLRepository.html
TECHNIQUES: TECHNIQUES: Latency and Latency and projectionprojection
Projection Dimensionality compression
Similitude information coding
Clustering Finding grouping structure in data
Similitude information coding
Self-Organizing Map (SOM) & Generative Topographic Mapping (GTM)
They combine latent representation and clustering
TECHNIQUES: TECHNIQUES: projectionprojection
Representation in <4-D, so that the distance-neighborhood relations between multi-dimensional points are faithfully preserved
It is impossible to preserve information integrally
Some scale normalization is required
Linear vs. non-linear projections
TECHNIQUES: TECHNIQUES: projection: methodsprojection: methods
Methods based on inter-point distances, where:dx = distance in the original spacedy = distancie in the projection space h = neighborhood function
E = (dx – dy)2 MDS, PCA
E = (dx – dy)2 / dx Sammon’s projectionE = (dx – dy)2 e-dy CCAE = dx2 h(dy) SOM
... and in which we aim to minimize an inherent projection distorsion (E)
TECHNIQUES: TECHNIQUES: projection: methods in a nutshellprojection: methods in a nutshell
MDS: technique used in data visualisation for exploring similarities or dissimilarities in data. An MDS algorithm starts with a matrix of item-item similarities, then assigns a location of each item in a low-dimensional space, suitable for graphing or 3D visualisation.Taxonomy:
Metric multidimensional scaling -- assumes the input matrix is just an item-item distance matrix. Analogous to PCA, an eigenvector problem is solved to find the locations that minimize distortions to the distance matrix. Its goal is to find a Euclidean distance approximating a given distance.Generalized multidimensional scaling (GMDS) -- A superset of metric MDS that allows for the target distances to be non-Euclidean.Non-metric multidimensional scaling -- It finds a non-parametric monotonic relationship between the dissimilarities in the item-item matrix and the Euclidean distance between items, and the location of each item in the low-dimensional space
Biblio:Abdi, H. (2007). Metric multidimensional scaling. In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage.Kruskal, J. B., and Wish, M. (1978), Multidimensional Scaling, Sage University Paper series on Quantitative Application in the Social Sciences, 07-011. Beverly Hills and London: Sage Publications.
TECHNIQUES: TECHNIQUES: projection: methods in a nutshellprojection: methods in a nutshell
PCA: It is a linear transformation that represents the data in a new coordinate system such that the greatest variance explained by the data lies on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA can be used for dimensionality reduction in a dataset by retaining only those characteristics of the dataset that contribute most to its variance.
Taxonomy:Kernel PCA
PPCA, CCA (when unfolding a nonlinear structure, Sammon's mapping cannot reproduce all distances. One way to face this problem consists in favouring local topology: CCA tries to reproduce short distances first, while long distances remain secondary.
FA
Some source code:Open Computer Vision Library @ sourceforge.net/projects/opencvlibrary/
Murtagh’s page @ http://astro.u-strasbg.fr/~fmurtagh/mda-sw/
TECHNIQUES:TECHNIQUES:projection: example projection: example
PCA Sammon’s projection
CCA
TECHNIQUES: TECHNIQUES: projection: discussion; pros & consprojection: discussion; pros & cons
Projection techniques code proximity / similarity information in spacial coordinates (plus, sometimes, extra precognitive elements such as colour ...)
They allow…
… Finding “natural” data groupings (clusters) on the basis of some sort of similarity
… Finding the “shapes” of these groupings
But ...
Projection is always limited by error and information loss
New projection coordinates are not always readily interpretable (latency by definition)
The original relations between data dimensions are lost
Quite often, the computacional effort is to be taken into account, as most of these methods are based on distances between multivariate points.
TECHNIQUES: TECHNIQUES: multiple visualizationsmultiple visualizations
How to get some of the info conveyed by observable variables back into the projections? One possibility: Using multiple visualizations.
Parallel coordinates and pre-cognitive stimuli (colour, position...)
TECHNIQUES: TECHNIQUES: SOM & GTMSOM & GTM
Self-Organizing Feature Map (or Kohonen Maps)
k-means is an special case of SOM
Discretization (in the form of network grids) and projection are simultaneously performed
Set of prototypes» modelCooperative learning (through neighbourhood function) Competitive learning (winner takes most –if not all-)
GTM is a probabilistic alternative to SOM (i.e., a form of statistical learning)
GTM is a generative model and, therefore, aims to reproduce data density distributions
It defines a proper error functionIt is a non-linear latent model that can be interpreted as a mixture model, as well.All the learning parameters can be adaptively optimized.
TECHNIQUES: TECHNIQUES: SOM & GTM: training / fittingSOM & GTM: training / fitting
The learning process for both models can be illustrated by the fisherman network simile.
TECHNIQUES: TECHNIQUES: SOM & GTM: SOM & GTM: clusteringclustering
The SOM and GTM “units” can be interpreted as micro-clusters
U-matrix (distance in local neighbourhood) or Magnification Factor (distorsion levels)
Discrete or fuzzy clusters discretos o borrosos, from local density or probability maxima
Hierarchical clustering and dendrograms
TECHNIQUES: TECHNIQUES: SOM & GTM: multiple visualizationSOM & GTM: multiple visualization
TECHNIQUES: TECHNIQUES: SOM & GTM: visualization of class SOM & GTM: visualization of class
membershipmembership
Visualization: Visualization: further exotismsfurther exotisms
Exotisms:Exotisms:Conic treesConic trees
Exotisms:Exotisms:MapscapesMapscapes
Visualization: Visualization: softwaresoftware
Visualizing data:Visualizing data:Simple and off the shelf:Simple and off the shelf:
SS&C: HeatmapsSS&C: Heatmaps®®
……Complex and off the shelf: Complex and off the shelf: TheBrain Tech. Corp.TheBrain Tech. Corp.
“This is the knowledge crisis – An ever-increasing demand for organizational knowledge coupled with an unforgiving environment in which to produce it. Currently, we have no systems to automate and capture the knowledge processes that are critical to our success.”
Woven and off the shelf: Woven and off the shelf: Ixacta Web AnalyzerIxacta Web Analyzer
Neighborhood sitemap diagram: Ixsite creates this diagram to help you visualize the relationship between the files on your site.
Woven and free: Woven and free: http://graphics.stanford.edu/http://graphics.stanford.edu/
SOM off the selve: SOM off the selve: Visumap (Visumap (www.visumap.net))Ellipse eSOM (Ellipse eSOM (www.ellipse.fi))
SOM fishing: SOM fishing:
REEFSOMREEFSOMApplied Neuroinformatics Group, Applied Neuroinformatics Group, Bielefeld University, GermanyBielefeld University, Germany
Visualization: Visualization: in summary …in summary …
In summary ...In summary ...
Which are the features of a good, successful visualization?
Show the data (exploratory element)
Focus the attention (… in the most relevant aspects)
Never forget the “human factor” in visual perception
The science of vision is the necessary framework for the visualization techniques
You have to be careful with pre-cognitive elements (position, movement, colour, shape) in visual coding of dimensions.
How to use visualization in exploratory data mining?
Visualization allows especulation and model validation.
Visualization of high-dimensional data sets can be accomplished through:
projections and clustering methods
multiple simultaneous visualizations.
Plan
A brief introduction to data visualization
Visualization & history
Perception
Visual exploratory DM
The good, the bad & the ugly …
The good ...The good ... According to According to Michael Friendly’s Michael Friendly’s Gallery of Gallery of
Data Visualization Data Visualization (Psych./York Univ.)(Psych./York Univ.)
NY weather in 1980. NYT, Jan.19812200 data pieces!!!
The good ...The good ... According to According to Michael Friendly’s Michael Friendly’s Gallery of Gallery of
Data Visualization Data Visualization (Psych./York Univ.)(Psych./York Univ.)
... And the bad and ugly... And the bad and ugly According to According to Michael Friendly’s Michael Friendly’s Gallery of Gallery of
Data Visualization Data Visualization (Psych./York Univ.)(Psych./York Univ.)
Off-campusOff-campus
Off-campusOff-campus
FRIDAY 27 - AFTERNOON
LIGHT AND DATA: A JOURNEY THROUGH THE NEW AESTHETICS OF INFORMATION ArtFutura is dedicating its first afternoon to the work of artists, scientists and designers who are developing new and innovative ways of visualizing information and giving it meaning.
18:00 hours / Room MAC -Mercat de las Flors-Andrew Vande Moere, Information Aesthetics (AUS) “Forms follow Data: An introduction to the art of data visualization” http://www.infosthetics.com/ Andrew Vande Moere is the editor of Information Aesthetics, the outstanding weblog dedicated to exploring the art and science of the dynamic representation of information. In his blog, Andrew shows and analyses artistic projects of design and investigation based on the exploration in real time of large databases and the communication, by means of innovative interfaces, of the meaningful patterns hidden within their interiors.Information Aesthetics offers an in-depth look into the exciting world of data landscapes, a discipline that having seduced artists and scientists promises too radically change our user experience in the area of information.
Off-campusOff-campus
Museo de la ciencia y la técnica de Catalunya (Terrassa)
http://www.mnactec.com/eng/index.htm
Until December 17th