data mining and visualization (2) alfredo vellido

49
Data Mining and visualization (2) Alfredo Vellido

Upload: brian-watkins

Post on 29-Dec-2015

226 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Mining and visualization (2) Alfredo Vellido

Data Miningand visualization

(2)

Alfredo Vellido

Page 2: Data Mining and visualization (2) Alfredo Vellido

Plan

A brief introduction to data visualization

Visualization & history

Perception

Visual exploratory DM

The good, the bad & the ugly …

Page 3: Data Mining and visualization (2) Alfredo Vellido

Visualization Visualization recaprecap

Page 4: Data Mining and visualization (2) Alfredo Vellido

Recap … PRINCIPLES:the data mining visual cycle, or

Visual Exploratory Data Mining

Preprocessing&

transformation

Graphicengine

Data exploratio

n

visuo-spatial model

cognitive-logicmodel

Hipothesis of reality

Control &navegatio

n

DATA

Data Manipulation

Data gathering

Model manipulation

MODEL

Page 5: Data Mining and visualization (2) Alfredo Vellido

Recap … CRISP: Methodology phases

Page 6: Data Mining and visualization (2) Alfredo Vellido

Recap 6 .. Data visualization vs model visualization

Page 7: Data Mining and visualization (2) Alfredo Vellido

Recap 7 … Data visualization vs model visualization

Page 8: Data Mining and visualization (2) Alfredo Vellido

Plan

A brief introduction to data visualization

Visualization & history

Perception

Visual exploratory DM

The good, the bad & the ugly …

Page 9: Data Mining and visualization (2) Alfredo Vellido

What type of visualization are we looking for?

Descriptive?

Exploratory?

Page 10: Data Mining and visualization (2) Alfredo Vellido

What type of visualization are we looking for?DESCRIPTIVE

Page 11: Data Mining and visualization (2) Alfredo Vellido

PRINCIPLES:a good visualization should...a good visualization should...

...show data and/or results...

...at different levels of detail, from the overall landscape to the fine detail

... in a coherent manner, even if we are dealing with large collections.

... avoiding, as much as possible, distortion in their representation

...focus attention in the most relevantes features...

...minimizing the impact of uninformative and misleading data

...integrating statistical results and linguistic descriptions (if possible and relevant).

Page 12: Data Mining and visualization (2) Alfredo Vellido

DATA EXPLORATION: The CURSE of dimensionality

Most data available to us are stored in different kinds of databases and in numeric format, mostly organized in table structures (remember survey!) An extension of these are the data cubes generated by OLAP processes.

How to display multiple dimensions? Cases:Low dimensionality (1-3D)Moderate dimensionality (4-10D)High dimensionality (>10D)

What are your preferred methods for storing data for data mining? [403 votes total]

Text, CSV (comma-separated) (72) 18%

Text, space or tab separated (55) 14%

Excel (38) 9%

SAS (57) 14%

SPSS (31) 8%

S-Plus/R (15) 4%

Weka ARFF (23) 6%

Other data mining tool format (11) 3%

In a database system (93) 23%

Other - please comment (8) 2%

Page 13: Data Mining and visualization (2) Alfredo Vellido

DATA EXPLORATION:low-moderate dimensionality <10D-moderate dimensionality <10D

Spatial coordinates 3D requires interactivity

Further pre-cognitive visual elements allow us to “add” extra dimensions:

color, movement, shape, …

Exotic solutionsGlyph*: Chernoff faces, stick-figures, whiskers...

* Un glifo es una representación gráfica de uno o varios caracteres, o de parte de un carácter. Un carácter es una entidad textual mientras que un glifo es una entidad gráfica.

Page 14: Data Mining and visualization (2) Alfredo Vellido

… … some of those alternativessome of those alternativesGantt diagrams…

Page 15: Data Mining and visualization (2) Alfredo Vellido

… … some of those alternativessome of those alternatives

Chernoff faces

Herman Chernoff (1973). "Using faces to represent points in k-dimensional space graphically". Journal of the American Statistical Association 68 (342): 361–368.

Page 16: Data Mining and visualization (2) Alfredo Vellido

DATA EXPLORATION: highhigh dimensionality data dimensionality data

How do we visualize data of high (or even very high) dimensionality?

Some of the alternatives are rather straightforward… some others are not…

Eliminate dimensions (data variables): those which are redundant and / or uninformative (at least you manage to alleviate part of the problem…)

Divide & conquer: a classic: create multiple visualizations of low dimensionality.

Latent and projection models

Page 17: Data Mining and visualization (2) Alfredo Vellido

DATA EXPLORATION: The The Grand TourGrand Tour: multiple visualization of: multiple visualization of Iris dataIris data

www.ics.uci.edu/~mlearn/MLRepository.html

Page 18: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES: TECHNIQUES: Latency and Latency and projectionprojection

Projection Dimensionality compression

Similitude information coding

Clustering Finding grouping structure in data

Similitude information coding

Self-Organizing Map (SOM) & Generative Topographic Mapping (GTM)

They combine latent representation and clustering

Page 19: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES: TECHNIQUES: projectionprojection

Representation in <4-D, so that the distance-neighborhood relations between multi-dimensional points are faithfully preserved

It is impossible to preserve information integrally

Some scale normalization is required

Linear vs. non-linear projections

Page 20: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES: TECHNIQUES: projection: methodsprojection: methods

Methods based on inter-point distances, where:dx = distance in the original spacedy = distancie in the projection space h = neighborhood function

E = (dx – dy)2 MDS, PCA

E = (dx – dy)2 / dx Sammon’s projectionE = (dx – dy)2 e-dy CCAE = dx2 h(dy) SOM

... and in which we aim to minimize an inherent projection distorsion (E)

Page 21: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES: TECHNIQUES: projection: methods in a nutshellprojection: methods in a nutshell

MDS: technique used in data visualisation for exploring similarities or dissimilarities in data. An MDS algorithm starts with a matrix of item-item similarities, then assigns a location of each item in a low-dimensional space, suitable for graphing or 3D visualisation.Taxonomy:

Metric multidimensional scaling -- assumes the input matrix is just an item-item distance matrix. Analogous to PCA, an eigenvector problem is solved to find the locations that minimize distortions to the distance matrix. Its goal is to find a Euclidean distance approximating a given distance.Generalized multidimensional scaling (GMDS) -- A superset of metric MDS that allows for the target distances to be non-Euclidean.Non-metric multidimensional scaling -- It finds a non-parametric monotonic relationship between the dissimilarities in the item-item matrix and the Euclidean distance between items, and the location of each item in the low-dimensional space

Biblio:Abdi, H. (2007). Metric multidimensional scaling. In N.J. Salkind (Ed.): Encyclopedia of Measurement and Statistics. Thousand Oaks (CA): Sage.Kruskal, J. B., and Wish, M. (1978), Multidimensional Scaling, Sage University Paper series on Quantitative Application in the Social Sciences, 07-011. Beverly Hills and London: Sage Publications.

Page 22: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES: TECHNIQUES: projection: methods in a nutshellprojection: methods in a nutshell

PCA: It is a linear transformation that represents the data in a new coordinate system such that the greatest variance explained by the data lies on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA can be used for dimensionality reduction in a dataset by retaining only those characteristics of the dataset that contribute most to its variance.

Taxonomy:Kernel PCA

PPCA, CCA (when unfolding a nonlinear structure, Sammon's mapping cannot reproduce all distances. One way to face this problem consists in favouring local topology: CCA tries to reproduce short distances first, while long distances remain secondary.

FA

Some source code:Open Computer Vision Library @ sourceforge.net/projects/opencvlibrary/

Murtagh’s page @ http://astro.u-strasbg.fr/~fmurtagh/mda-sw/

Page 23: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES:TECHNIQUES:projection: example projection: example

PCA Sammon’s projection

CCA

Page 24: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES: TECHNIQUES: projection: discussion; pros & consprojection: discussion; pros & cons

Projection techniques code proximity / similarity information in spacial coordinates (plus, sometimes, extra precognitive elements such as colour ...)

They allow…

… Finding “natural” data groupings (clusters) on the basis of some sort of similarity

… Finding the “shapes” of these groupings

But ...

Projection is always limited by error and information loss

New projection coordinates are not always readily interpretable (latency by definition)

The original relations between data dimensions are lost

Quite often, the computacional effort is to be taken into account, as most of these methods are based on distances between multivariate points.

Page 25: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES: TECHNIQUES: multiple visualizationsmultiple visualizations

How to get some of the info conveyed by observable variables back into the projections? One possibility: Using multiple visualizations.

Parallel coordinates and pre-cognitive stimuli (colour, position...)

Page 26: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES: TECHNIQUES: SOM & GTMSOM & GTM

Self-Organizing Feature Map (or Kohonen Maps)

k-means is an special case of SOM

Discretization (in the form of network grids) and projection are simultaneously performed

Set of prototypes» modelCooperative learning (through neighbourhood function) Competitive learning (winner takes most –if not all-)

GTM is a probabilistic alternative to SOM (i.e., a form of statistical learning)

GTM is a generative model and, therefore, aims to reproduce data density distributions

It defines a proper error functionIt is a non-linear latent model that can be interpreted as a mixture model, as well.All the learning parameters can be adaptively optimized.

Page 27: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES: TECHNIQUES: SOM & GTM: training / fittingSOM & GTM: training / fitting

The learning process for both models can be illustrated by the fisherman network simile.

Page 28: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES: TECHNIQUES: SOM & GTM: SOM & GTM: clusteringclustering

The SOM and GTM “units” can be interpreted as micro-clusters

U-matrix (distance in local neighbourhood) or Magnification Factor (distorsion levels)

Discrete or fuzzy clusters discretos o borrosos, from local density or probability maxima

Hierarchical clustering and dendrograms

Page 29: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES: TECHNIQUES: SOM & GTM: multiple visualizationSOM & GTM: multiple visualization

Page 30: Data Mining and visualization (2) Alfredo Vellido

TECHNIQUES: TECHNIQUES: SOM & GTM: visualization of class SOM & GTM: visualization of class

membershipmembership

Page 31: Data Mining and visualization (2) Alfredo Vellido

Visualization: Visualization: further exotismsfurther exotisms

Page 32: Data Mining and visualization (2) Alfredo Vellido

Exotisms:Exotisms:Conic treesConic trees

Page 33: Data Mining and visualization (2) Alfredo Vellido

Exotisms:Exotisms:MapscapesMapscapes

Page 34: Data Mining and visualization (2) Alfredo Vellido

Visualization: Visualization: softwaresoftware

Page 35: Data Mining and visualization (2) Alfredo Vellido

Visualizing data:Visualizing data:Simple and off the shelf:Simple and off the shelf:

SS&C: HeatmapsSS&C: Heatmaps®®

Page 36: Data Mining and visualization (2) Alfredo Vellido

……Complex and off the shelf: Complex and off the shelf: TheBrain Tech. Corp.TheBrain Tech. Corp.

“This is the knowledge crisis – An ever-increasing demand for organizational knowledge coupled with an unforgiving environment in which to produce it. Currently, we have no systems to automate and capture the knowledge processes that are critical to our success.”

Page 37: Data Mining and visualization (2) Alfredo Vellido

Woven and off the shelf: Woven and off the shelf: Ixacta Web AnalyzerIxacta Web Analyzer

Neighborhood sitemap diagram: Ixsite creates this diagram to help you visualize the relationship between the files on your site.

Page 38: Data Mining and visualization (2) Alfredo Vellido

Woven and free: Woven and free: http://graphics.stanford.edu/http://graphics.stanford.edu/

Page 39: Data Mining and visualization (2) Alfredo Vellido

SOM off the selve: SOM off the selve: Visumap (Visumap (www.visumap.net))Ellipse eSOM (Ellipse eSOM (www.ellipse.fi))

Page 40: Data Mining and visualization (2) Alfredo Vellido

SOM fishing: SOM fishing:

REEFSOMREEFSOMApplied Neuroinformatics Group, Applied Neuroinformatics Group, Bielefeld University, GermanyBielefeld University, Germany

Page 41: Data Mining and visualization (2) Alfredo Vellido

Visualization: Visualization: in summary …in summary …

Page 42: Data Mining and visualization (2) Alfredo Vellido

In summary ...In summary ...

Which are the features of a good, successful visualization?

Show the data (exploratory element)

Focus the attention (… in the most relevant aspects)

Never forget the “human factor” in visual perception

The science of vision is the necessary framework for the visualization techniques

You have to be careful with pre-cognitive elements (position, movement, colour, shape) in visual coding of dimensions.

How to use visualization in exploratory data mining?

Visualization allows especulation and model validation.

Visualization of high-dimensional data sets can be accomplished through:

projections and clustering methods

multiple simultaneous visualizations.

Page 43: Data Mining and visualization (2) Alfredo Vellido

Plan

A brief introduction to data visualization

Visualization & history

Perception

Visual exploratory DM

The good, the bad & the ugly …

Page 44: Data Mining and visualization (2) Alfredo Vellido

The good ...The good ... According to According to Michael Friendly’s Michael Friendly’s Gallery of Gallery of

Data Visualization Data Visualization (Psych./York Univ.)(Psych./York Univ.)

NY weather in 1980. NYT, Jan.19812200 data pieces!!!

Page 45: Data Mining and visualization (2) Alfredo Vellido

The good ...The good ... According to According to Michael Friendly’s Michael Friendly’s Gallery of Gallery of

Data Visualization Data Visualization (Psych./York Univ.)(Psych./York Univ.)

Page 46: Data Mining and visualization (2) Alfredo Vellido

... And the bad and ugly... And the bad and ugly According to According to Michael Friendly’s Michael Friendly’s Gallery of Gallery of

Data Visualization Data Visualization (Psych./York Univ.)(Psych./York Univ.)

Page 47: Data Mining and visualization (2) Alfredo Vellido

Off-campusOff-campus

Page 48: Data Mining and visualization (2) Alfredo Vellido

Off-campusOff-campus

FRIDAY 27 - AFTERNOON

LIGHT AND DATA: A JOURNEY THROUGH THE NEW AESTHETICS OF INFORMATION ArtFutura is dedicating its first afternoon to the work of artists, scientists and designers who are developing new and innovative ways of visualizing information and giving it meaning.

18:00 hours / Room MAC -Mercat de las Flors-Andrew Vande Moere, Information Aesthetics (AUS) “Forms follow Data: An introduction to the art of data visualization” http://www.infosthetics.com/ Andrew Vande Moere is the editor of Information Aesthetics, the outstanding weblog dedicated to exploring the art and science of the dynamic representation of information. In his blog, Andrew shows and analyses artistic projects of design and investigation based on the exploration in real time of large databases and the communication, by means of innovative interfaces, of the meaningful patterns hidden within their interiors.Information Aesthetics offers an in-depth look into the exciting world of data landscapes, a discipline that having seduced artists and scientists promises too radically change our user experience in the area of information.