improving the correlation hunting in a largequantity of som component planes
TRANSCRIPT
Improving the correlation hunting in a largequantity of SOM component planes
Classification of agro-ecological variables related withproductivity in the sugar cane culture.
Miguel BARRETOAndrés Pérez-Uribe
MINISTERIO DE AGRICULTURA Y
DESARROLLO RURAL
asocaña
Self Organizing Maps
A Self-organizing maps (SOMs) can be seen as a data visualization technique that reduces the dimensionality of data through the use of a self-organizing clustering algorithm.
The problem that data visualization attempts to solve is that humans cannot visualize high dimensional data.
These techniques can be used to improve the understanding of high dimensional data by visualizing information in a low dimensional space.
A SOM presents high dimensional data in a low dimensional space by placing points that are close in the high dimensional space, close in the low dimensional space.
From a computational point of view, the self-organizing model is both a projection method which maps high-dimensional data space into low-dimensional space (reduction of dimensionality), and a clustering method, so that similar data samples tend to be mapped to nearby neurons.
Component planes
V1 TMAS P1AS Ra1AS
Vector 1
Vector 2
Vector n
To improve the analysis of the relationships between variables and/or their influence on the outputs of the system, it is possible to slice the Self-organizing maps in order to visualize their so-called component planes
Example: Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study (Junbai Wang et al, 2002)
a) 42 DLBCL samples, the color scale of SOM red indicates high expression, blue indicates low expression. b) The cluster numbers resprent gruop of genes contained.
Correlation hunting
The task of organizing similar components planes in order to find correlating components is called correlation hunting.
Correlation huntingThe expression correlation does not include just linear correlations, butalso nonlinear and local or partial correlations between variables
Correlation hunting
However, when the number of components is large it is difficult to determine which planes are similar to each other.
Correlation huntingA new SOM can be used to reorganize the component planes in order to perform the correlation hunting. The main idea is to place correlated components close to each other.
Correlation hunting
An advantage of using a SOM for component plane projection is that the placements of the component planes can be shown on a regular grid. In addition, an ordered presentation of similar components is automatically generated. A disadvantage is that the choice of grouping variables is left to the user.
More component planes …
Heart disease
279 component planes
This database contains 13 attributes (which have been extracted froma larger set of 75)
Clustering of SOM component planes based on the SOM distance matrix
The U-matrix had been used as an effective cluster distance function. The U-matrix visualizes distances between each map unit and its neighbors, thus it is possible to visualize the SOM cluster structure.
Use the Vellido’s algorithm to partition the map
The Vellido’s algorithm is used to obtain different partitioning levels of the clustering of the SOM. The Vellido’s algorithm provides a partitioning of the map into a set of base clusters. The number of clusters is equal to the number of local minima on the U-matrix; allowing different levels of clustering.
Tree-structured component planes clusters representation
Case study: sugar cane culture
The agricultural productivity of a geographic area depends on many agro-ecological variables like soil and terrain characteristics, climaticconstraints, human behavior and management.
Soil
Management
Climate
Genotype
Productivity
A new approach
1358 experiments
Sowing Growing Harvest
SoilManagementClimate Genotype
Each agroecological event is unique in time and space, but it is possible to find similar characteristics between events that allow finding similar behaviors permitting to discover why and how the agroecological variables affect the crop development and therefore the agricultural productivity.
The variables
Climate variables. Continuous data.
Average Temperature (TempAvg), / After seed (AS) / Before Harvest (BH) Average Relative Humidity (RHAvg) / After seed (AS) / Before Harvest (BH) Radiation (Rad) / After seed (AS) / Before Harvest (BH) Precipitation (Prec) / After seed (AS) / Before Harvest (BH)Soil variables. Order (Ord) / 3 Orders (Ord1, Ord2, Ord3) Nominal Data Texture (Tex) / Ordinal Data Deep (Dee)/ Ordinal DataTopographic variables. Landscape (Ls) / 3 Landscapes (Ls1, Ls2, Ls3) Nominal Data Slope (Sl). / Ordinal DataOther variables. Water Balance (WB) Ordinal Data Variety (Var) / 3 varieties (V1, V2, V3) Nominal DataProductionTotal 54
Months After Seed (AS)
Months Before Harvest (BH)
1 2 3 4 1 2 3 4
Components planes / sugar cane culture
Tree-structured component planes clusters representation / sugar cane culture
Tree-structured component planes clusters representation / sugar cane culture
Classification of agro-ecological variables related with productivity (initial analysis)
Ra1BH, Ra1AS and Productivity BMUs
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116
Best Matching Units
Mag
nitude
of th
e va
riab
les
Ra1AS Ra1BH Productivity
BMUs of the component planes: productivity, radiation 1 month before harvest(Ra1BH) and radiation 1 month after seed (Ra1AS).
Conclusions
Visualization of agroecological variables is very important but difficult due to the high dimensionality of the data. The SOM algorithm is a powerful technique able to deal with this problem, but it is used as an exploratory analysis.
In this study is presented a methodology to enhance the component planes analysis process. This methodology improves the correlation hunting in the component planes with a tree-structured clusters representation based on the SOM distance matrix.
By analyzing the obtained groups of agro-ecological variables and cultivated zones, it was possible, as an example of the application of the methodology, to find a relationship between the radiation after seed, before harvest, and a high-medium productivity.
We are currently looking forward to develop data mining and visualization techniques in order to improve the decision support in the sugar cane culture based on the aforementioned methodology.
The end
Thanks for new ideas and directions to explore!