a comparative study between ica (independent component analysis) and pca (principal component...

A Comparative Study between

ICA and PCA

Md. Sahidul IslamRoll No. 08054718

Department of StatisticsUniversity of Rajshahi

[email protected]

1

Department of Statistics, University of Rajshahi-6205

Overview

Motivation of the study

Objective

Definition of ICA

FastICA algorithm

Results of the study

Latent structure

Cluster analysis

Outlier detection

Conclusions

2


Motivation of the study

o In multivariate statistics Latent structure detection, cluster analysis, and outlier detection using PCA is a promising old technique.

o In many cases ICA perform better than PCA.

o Our motivation in this thesis is to perform latent structure, cluster analysis and outlier detection using ICA and compare it with that of PCA

3


Objectives

o Study algorithms of ICA

o Applying ICA for Latent structure detection, cluster analysis

and outlier detection.

o Comparing its performance with that of PCA

4


Independent Component Analysis

The simple “Cocktail Party” Problem

SourcesObservations

s1

s2

x1

x2

Mixing matrix A

2221212

2121111

sasax

sasax

11a

12a

22a

21a

2

1

2221

1211

2

1

s

s

aa

aa

x

x

x=As

5

ICA

PCAy= WTx


Non-gaussianity is independent

Central limit theorem

The distribution of a sum of independent random variables tends

toward a Gaussian distribution

Observed signal = S1 S2 Sna1 + a2 ….+ an

toward Gaussian Non-GaussianNon-GaussianNon-Gaussian6


Non-guassianity is Independent

Nongaussianity estimates independent

Estimation of y = wT x =wTAs = zTs

let z = AT w, so y = wTAs = zTs

y is a linear combination of si, therefore zTs is more gaussian than any of si

zTs becomes least gaussian when it is equal to one of the si

wTx = zTs equals an independent component

Maximizing nongaussianity of wTx gives us one of the independent components

7


FastICA algorithm

Iteration procedure for maximizing nongaussianity

Step1: choose an initial weight vector w

Step2: Let w+=E[xg(wTx)]-E[g’(wTx)]w(g: a non-quadratic function)

Step3: Let w=w+/||w+||

Step4: if not converged, go back to

Step2

8


Results and Discussions

Latent structure detection

9


Simulated dataset -1

Figure: Matrix plot of original source of 10 uniform distribution.

10



Figure: (a) Matrix plot of 10 principal components. (b) Matrix plot of source variables.

11



Figure: (a) Matrix plot of 10 independent components. (b) Matrix plot of source variables

12


Simulated dataset-2

Simulated dataset-2 consists of

5 variables comes from Laplace

(super-gaussian), uniform

(sub-gaussian), binomial,

multinomial and normal

distribution each have 10000

observation.

Figure: Matrix plot of original source of 5 variables each

comes form different distribution.

13

Department of Statistics, University of Rajshahi-6205 14

Simulated dataset-2

Figure: (Left)Matrix plot of principle components. (Right) Original source of 5 variables

each comes form different distribution.


Simulated dataset-2

15

Figure: (Left)Matrix plot of independent components. (Right) Original source of 5

variables each comes form different distribution.


Cluster Analysis

16


The first experiment of real data set for clustering is Australian crabs data set where

there are 200 rows and 8 columns describing the 5 morphological measurements

(Frontal lob size, Rear width, Carapace length, Carapace width, Body depth). There

are two species in the data set each have both sexes (male, female) of the genus

Leptograpsus. There are 50 specimens of each sex of each species, collected on site

at Fremantle, Western Australia. (N. A. Campbell et al., 1974).

Australian Crabs dataset

17


The second example of real data set is world famous Fishers Iris data set

where the data report four characteristics (sepal width, sepal length, petal

width and petal length) of three species (setosa, versicolor, virginica) of Iris

flower.

Fisher Iris dataset

18


Outlier detection

19


Scottish hill racing dataset

The data gives the record wining times for 35 hill races in Scotland (Atkinson,

1986). The purpose of that study was to investigate the relationship of record

time 35 hill races.

20


Epilepsy dataset

Thal and Vail reported data from clinical trial of 59 patients with

epilepsy, 31 of whom were randomized to receive the anti-epilepsy

drug Progabide and 28 receive placebo

21


This data consists of 21 days of operation for a plant for the

oxidation of ammonia as a stage in the production of nitric acid. The

response is called stack loss which is percent of uncovered

ammonia that escapes from the planet. There are three explanatory

and one response variable in the dataset.

Stackloss data

22


Education expenditure dataset

These data are used by Chatterjee, Hadi, and Price as an example

of heteroscedasticity. The data gives the education expenditures of

U.S. states as projected in 1975.

23


Conclusions

If the subject domain supports the assumption of

independent non-gaussian source variables, we

recommended of using ICA in place of PCA for latent

structure detection, clustering and outlier detection.

24


Future Research

The following are the areas in which we want to study

o Use Kernel technique of ICA for shape study, clustering and outlier

detection.

o Separation of Nonlinear mixture.

o Data mining (sometimes called data or knowledge discovery) is the

most recent technique in multivariate analysis to extract information

from a data set and transform it into an understandable structure for

further use. Text data mining or Medical data mining using ICA wolud

be future research.

25


Thank you

26