a comparative study between ica (independent component analysis) and pca (principal component...
DESCRIPTION
A Comparative Study between ICA (Independent Component Analysis) and PCA (Principal Component Analysis)TRANSCRIPT
A Comparative Study between
ICA and PCA
Md. Sahidul IslamRoll No. 08054718
Department of StatisticsUniversity of Rajshahi
1
Department of Statistics, University of Rajshahi-6205
Overview
Motivation of the study
Objective
Definition of ICA
FastICA algorithm
Results of the study
Latent structure
Cluster analysis
Outlier detection
Conclusions
2
Department of Statistics, University of Rajshahi-6205
Motivation of the study
o In multivariate statistics Latent structure detection, cluster analysis, and outlier detection using PCA is a promising old technique.
o In many cases ICA perform better than PCA.
o Our motivation in this thesis is to perform latent structure, cluster analysis and outlier detection using ICA and compare it with that of PCA
3
Department of Statistics, University of Rajshahi-6205
Objectives
o Study algorithms of ICA
o Applying ICA for Latent structure detection, cluster analysis
and outlier detection.
o Comparing its performance with that of PCA
4
Department of Statistics, University of Rajshahi-6205
Independent Component Analysis
The simple “Cocktail Party” Problem
SourcesObservations
s1
s2
x1
x2
Mixing matrix A
2221212
2121111
sasax
sasax
11a
12a
22a
21a
2
1
2221
1211
2
1
s
s
aa
aa
x
x
x=As
5
ICA
PCAy= WTx
Department of Statistics, University of Rajshahi-6205
Non-gaussianity is independent
Central limit theorem
The distribution of a sum of independent random variables tends
toward a Gaussian distribution
Observed signal = S1 S2 Sna1 + a2 ….+ an
toward Gaussian Non-GaussianNon-GaussianNon-Gaussian6
Department of Statistics, University of Rajshahi-6205
Non-guassianity is Independent
Nongaussianity estimates independent
Estimation of y = wT x =wTAs = zTs
let z = AT w, so y = wTAs = zTs
y is a linear combination of si, therefore zTs is more gaussian than any of si
zTs becomes least gaussian when it is equal to one of the si
wTx = zTs equals an independent component
Maximizing nongaussianity of wTx gives us one of the independent components
7
Department of Statistics, University of Rajshahi-6205
FastICA algorithm
Iteration procedure for maximizing nongaussianity
Step1: choose an initial weight vector w
Step2: Let w+=E[xg(wTx)]-E[g’(wTx)]w(g: a non-quadratic function)
Step3: Let w=w+/||w+||
Step4: if not converged, go back to
Step2
8
Department of Statistics, University of Rajshahi-6205
Results and Discussions
Latent structure detection
9
Department of Statistics, University of Rajshahi-6205
Simulated dataset -1
Figure: Matrix plot of original source of 10 uniform distribution.
10
Department of Statistics, University of Rajshahi-6205
Simulated dataset -1
Figure: (a) Matrix plot of 10 principal components. (b) Matrix plot of source variables.
11
Department of Statistics, University of Rajshahi-6205
Simulated dataset -1
Figure: (a) Matrix plot of 10 independent components. (b) Matrix plot of source variables
12
Department of Statistics, University of Rajshahi-6205
Simulated dataset-2
Simulated dataset-2 consists of
5 variables comes from Laplace
(super-gaussian), uniform
(sub-gaussian), binomial,
multinomial and normal
distribution each have 10000
observation.
Figure: Matrix plot of original source of 5 variables each
comes form different distribution.
13
Department of Statistics, University of Rajshahi-6205 14
Simulated dataset-2
Figure: (Left)Matrix plot of principle components. (Right) Original source of 5 variables
each comes form different distribution.
Department of Statistics, University of Rajshahi-6205
Simulated dataset-2
15
Figure: (Left)Matrix plot of independent components. (Right) Original source of 5
variables each comes form different distribution.
Department of Statistics, University of Rajshahi-6205
Cluster Analysis
16
Department of Statistics, University of Rajshahi-6205
The first experiment of real data set for clustering is Australian crabs data set where
there are 200 rows and 8 columns describing the 5 morphological measurements
(Frontal lob size, Rear width, Carapace length, Carapace width, Body depth). There
are two species in the data set each have both sexes (male, female) of the genus
Leptograpsus. There are 50 specimens of each sex of each species, collected on site
at Fremantle, Western Australia. (N. A. Campbell et al., 1974).
Australian Crabs dataset
17
Department of Statistics, University of Rajshahi-6205
The second example of real data set is world famous Fishers Iris data set
where the data report four characteristics (sepal width, sepal length, petal
width and petal length) of three species (setosa, versicolor, virginica) of Iris
flower.
Fisher Iris dataset
18
Department of Statistics, University of Rajshahi-6205
Outlier detection
19
Department of Statistics, University of Rajshahi-6205
Scottish hill racing dataset
The data gives the record wining times for 35 hill races in Scotland (Atkinson,
1986). The purpose of that study was to investigate the relationship of record
time 35 hill races.
20
Department of Statistics, University of Rajshahi-6205
Epilepsy dataset
Thal and Vail reported data from clinical trial of 59 patients with
epilepsy, 31 of whom were randomized to receive the anti-epilepsy
drug Progabide and 28 receive placebo
21
Department of Statistics, University of Rajshahi-6205
This data consists of 21 days of operation for a plant for the
oxidation of ammonia as a stage in the production of nitric acid. The
response is called stack loss which is percent of uncovered
ammonia that escapes from the planet. There are three explanatory
and one response variable in the dataset.
Stackloss data
22
Department of Statistics, University of Rajshahi-6205
Education expenditure dataset
These data are used by Chatterjee, Hadi, and Price as an example
of heteroscedasticity. The data gives the education expenditures of
U.S. states as projected in 1975.
23
Department of Statistics, University of Rajshahi-6205
Conclusions
If the subject domain supports the assumption of
independent non-gaussian source variables, we
recommended of using ICA in place of PCA for latent
structure detection, clustering and outlier detection.
24
Department of Statistics, University of Rajshahi-6205
Future Research
The following are the areas in which we want to study
o Use Kernel technique of ICA for shape study, clustering and outlier
detection.
o Separation of Nonlinear mixture.
o Data mining (sometimes called data or knowledge discovery) is the
most recent technique in multivariate analysis to extract information
from a data set and transform it into an understandable structure for
further use. Text data mining or Medical data mining using ICA wolud
be future research.
25
Department of Statistics, University of Rajshahi-6205
Thank you
26