independent component analysisarthurg/papers/icatutorial.pdf · ica: setting independent component...

Independent Component Analysis

Arthur Gretton

Carnegie Mellon University

2009

ICA: setting

Independent component analysis:

• S a vector of l unknown, independent sources: PS =∏l

i=1 PSi

• X vector of mixtures

• A is l × l mixing matrix (full rank)

ICA: setting


• B is estimated A−1, we solve for this

• Y vector of estimated sources

ICA: setting


• B is estimated A−1, we solve for this

• Y vector of estimated sources

Neglect time dependence: m i.i.d. mixture observations

ICA: another example

• Mixtures X are

original EEG

[Jung et al., 2000]

• Estimated sources

Y are ICA

components

• Scalp map from B

ICA examples

• We’ve seen:

– Sounds mixed together (“cocktail party” problem) [Hyvarinen et al., 2001]

– EEG recordings (brain, fetal heartbeat) [Jung et al., 2000, Stogbauer et al., 2004]

• Some further examples:

– Extracting independent activity from fMRI [Calhoun et al., 2003]

– Financial data [Kiviluoto and Oja, 1998]

– Linear edge filters for image patch coding? (Possibly not: [Bethge, 2006])

A toy example

• Two distributions: PS1 is uniform, PS2 is bimodal

−2 −1 0 1 20

20

40

60

80

100

Source 1, uniform

−2 −1 0 1 20

50

100

150

200

250

Source 2, bimodal

A toy example


First indeterminacy: ordering

• Initial unmixed RVs in red

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input sources

source 1

so

urc

e 2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Rotation π/6

mixture 1

mix

ture

2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Rotation π/4

mixture 1

mix

ture

2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Rotation π/3

mixture 1

mix

ture

2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Rotation π/2

mixture 1

mix

ture

2

• Independent at rotation π/2

First indeterminacy: ordering


−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input sources

source 1

so

urc

e 2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Rotation π/6

mixture 1

mix

ture

2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Rotation π/4

mixture 1

mix

ture

2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Rotation π/3

mixture 1

mix

ture

2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Rotation π/2

mixture 1

mix

ture

2

• Independent at rotation π/2

Ignore source order

Second indeterminacy: sign


• Source 2 sign reversed in blue

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input sources

source 1

sourc

e 2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Mixture

mixture 1

mix

ture

2




−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input sources

source 1

sourc

e 2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Mixture

mixture 1

mix

ture

2

Ignore source sign




−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input sources

source 1

sourc

e 2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Mixture

mixture 1

mix

ture

2

• More generally: S1 and S2 independent iff

aS1 and S2 independent for a 6= 0

– Assume sources have unit variance

Third indeterminacy: Gaussians

Both sources GaussianSource distribution P

S1 P

S2

Source S1

So

urc

e S

2

−2 0 2−3

−2

−1

0

1

2

3

−4 −2 0 2 40

0.1

0.2

0.3

0.4

Source S1

P(S

1)

−4 −2 0 2 40

0.1

0.2

0.3

0.4

Source S2

P(S

2)

Third indeterminacy: Gaussians

Both sources GaussianSource distribution P

S1 P

S2

Source S1

So

urc

e S

2

−2 0 2−3

−2

−1

0

1

2

3

−4 −2 0 2 40

0.1

0.2

0.3

0.4

Source S1

P(S

1)

−4 −2 0 2 40

0.1

0.2

0.3

0.4

Source S2

P(S

2)

Meaningless to “unmix” Gaussians

Things that are impossible for ICA

Using independence alone, we cannot . . .

• recover signal order,

• recover signal sign (or amplitude) ,

• separate multiple Gaussians.

Things that are impossible for ICA

Using independence alone, we cannot . . .

• recover signal order,

• recover signal sign (or amplitude) ,

• separate multiple Gaussians.

We can recover

B∗ = PDA−1

• P is a permutation matrix

• D diagonal, dii ∈ −1, 1

(as long as no more than one Gaussian source)

First step in ICA: decorrelate

• Idea: remove all dependencies of order 2 between mixtures X



−2 −1 0 1 2−1.5

−1

−0.5

0

0.5

1

1.5

X

Y

Uncorrelated

−10 −5 0 5 10−6

−4

−2

0

2

4

6

A X

A Y

Correlated



• New signals have unit covariance:

T = BwX Ct = I

• We thus break up B as follows:

B = BrBw

– Bw is a whitening matrix

– Br is remaining demixing operation

• Use the SVD of mixture covariance Cx = UΛU⊤:

Bw = Λ−1/2U

⊤

What does decorrelation achieve?


−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input sources

source 1

sourc

e 2

−20 −10 0 10 20−4

−3

−2

−1

0

1

2

3

4

Observed mixtures

mixture 1

mix

ture

2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

After decorrelation

mixture 1

mix

ture

2

Problem remaining: rotation

• Assume correlation has already been removed

• To recover original signal, need to rotate

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input sources

source 1

sourc

e 2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

After decorrelation

mixture 1m

ixtu

re 2

• In remainder: unmixing matrix B is rotation,

B⊤B = I

ICA: maximum likelihood

• Model for mixtures parametrised by (B, PS)

Source distribution PS1

PS2

Source S1

Sourc

e S

2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2




PS2

Source S1

Sourc

e S

2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

Angle (× π/2)

Log lik

elih

ood

Unmixing angle for B: 0




PS2

Source S1

Sourc

e S

2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

Angle (× π/2)

Log lik

elih

ood

Unmixing angle for B: π/12




PS2

Source S1

Sourc

e S

2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

Angle (× π/2)

Log lik

elih

ood

Unmixing angle for B: π/4


• We have a model for the observations, parametrised by (B, PS)

– Model must have PS =∏l

i=1 PSi




i=1 PSi

• With this model, our estimated density of observations is

PX = |det(B)| PS(BX)




i=1 PSi


PX =»

»»

»»|det(B)| PS(BX)




i=1 PSi


PX = PS(BX)

• Maximise the expected log likelihood,

L := EX

[log PX

]

Maximum likelihood: where it fails

• Model as before, but true source densities are Laplace.

• Why is this so wrong?

−10 −5 0 5 10−10

−5

0

5

10

Input sources

source 1

sour

ce 2

−5 0 5−8

−6

−4

−2

0

2

4

6

8

Max. likelihood solution

est. source 1es

t. so

urce

2

Maximum likelihood: where it fails

• Model as before, but true source densities are Laplace.

• Why is this so wrong?


PS2

Source S1

Sou

rce

S2

−5 0 5−5

0

5

0 0.5 1−1.95

−1.9

−1.85

−1.8

−1.75

−1.7

−1.65

−1.6

−1.55

Angle (× π/2)Lo

g lik

elih

ood

Back to original setting: independence

• Ideally: contrast φ(Y ) = 0 if and only if all components of Y mutually

independent:

PY =l∏

i=1

PYi.

– Under our mixing assumptions: Y are original sources S besides

permutations, sign swaps

Back to original setting: independence

• Ideally: contrast φ(Y ) = 0 if and only if all components of Y mutually

independent:

PY =l∏

i=1

PYi.

– Under our mixing assumptions: Y are original sources S besides

permutations, sign swaps

• How it’s really used: contrast should be “smallest” when random

variables are “most independent”

Mutual information

• The mutual information:

I(Y ) = DKL

(PY

∥∥∥∥∥

l∏

i=1

PYi

)

• DKL ≥ 0 with equality iff PY =∏l

i=1 PYi

Mutual information


I(Y ) = DKL

(PY

∥∥∥∥∥

l∏

i=1

PYi

)


i=1 PYi

• Simplification: when B is a rotation,

DKL

(PY

∥∥∥∥∥

l∏

i=1

PYi

)=

l∑

i=1

h (Yi) − h (X) − log |detB| .

where h(Y ) = −EY log(PY (y))

Mutual information


I(Y ) = DKL

(PY

∥∥∥∥∥

l∏

i=1

PYi

)


i=1 PYi


DKL

(PY

∥∥∥∥∥

l∏

i=1

PYi

)=

l∑

i=1

h (Yi) − h (X) − log |detB| .︸︷︷︸constant


Mutual information


I(Y ) = DKL

(PY

∥∥∥∥∥

l∏

i=1

PYi

)


i=1 PYi


DKL

(PY

∥∥∥∥∥

l∏

i=1

PYi

)=

l∑

i=1

h (Yi) − h (X) − log |detB| .︸︷︷︸constant


Contrast: φKL(Y ) :=∑l

i=1 h (Yi)

Maximum likelihood revisited

• Mutual information contrast: minimize

φKL(Y ) :=l∑

i=1

−EY log(PY (y))

• Maximum likelihood: maximize

L := EX

[log PS(BX)

]

=l∑

i=1

EY log(PY (y))

• Same thing!

Maximum likelihood revisited

• Mutual information contrast: minimize

φKL(Y ) :=l∑

i=1

−EY log(PY (y))

• Maximum likelihood: maximize

L := EX

[log PS(BX)

]

=l∑

i=1

EY log(PY (y))

• Same thing!

• The difference is in approach:

– For max. likelihood we assumed a model PS

– Now we assume no model for PY (though we still make assumptions)

Contrast functions with fixed nonlinearities

• Entropies hard to compute/optimize: replace with

φf (Y ) =l∑

j=1

E(f(Yj))

for some other nonlinear f(y)

2 22

Contrast functions with fixed nonlinearities

• Entropies hard to compute/optimize: replace with

φf (Y ) =l∑

j=1

E(f(Yj))

for some other nonlinear f(y)

−2 −1 0 1 20

2

4

6

8

10

12

14

16

Jade

f(y) = y4

−2 −1 0 1 29

9.2

9.4

9.6

9.8

10

Infomax

f(y) = a−exp(−y2/2)sech2(y)

−2 −1 0 1 20

0.5

1

1.5

2

Fast ICA

f(y) =1

alog cosh(ay),

Our example again

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input sources

source 1

sourc

e 2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

After decorrelation

mixture 1

mix

ture

2

0 0.5 12.5

3

3.5

4

4.5

5

Contr

ast

Jade

Angle (× π/2)0 0.5 1

8.9

9

9.1

9.2

9.3

9.4

Imax

Angle (× π/2)

Contr

ast

0 0.5 1

1.5

1.55

1.6

1.65

1.7

1.75

1.8

Fica

Angle (× π/2)

Contr

ast

Kurtosis: an important concept

• Kurtosis definition: when mean is zero,

κ4 = E(x4)− 3

(E

(x2))2

.

• Source densities can be super-Gaussian (positive kurtosis) or

sub-Gaussian (negative kurtosis)

• Zero kurtosis does not mean Gaussian!

−5 0 50

0.05

0.1

0.15

0.2

0.25

S

p(S

)

Sub−Gaussian (uniform)

−5 0 50

0.1

0.2

0.3

0.4

0.5

S

p(S

)

Super−Gaussian (Laplace)

Demo: contrasts with fixed nonlinearities

• Super-Gaussian (Laplace) and sub-Gaussian (Uniform) sources

• Unmixed sources in red

• Mixture (angle π/6) in black

−10 −5 0 5 10−8

−6

−4

−2

0

2

4

6

Super−Gaussian sources

−4 −2 0 2 4−3

−2

−1

0

1

2

3

Sub−Gaussian sources

Demo: contrasts with fixed nonlinearities

• Results for Jade, Infomax, and Fast ICA contrasts

0 0.5 1200

250

300

350

400

450

500

Su

pe

r−G

au

ssia

n

Jade

0 0.5 1

8.64

8.66

8.68

8.7

8.72

8.74

Imax

0 0.5 11.02

1.04

1.06

1.08

1.1

1.12

1.14

1.16

Fica

0 0.5 13.5

4

4.5

5

Su

b−

Ga

uss

ian

Angle (× π/2)0 0.5 1

9

9.05

9.1

9.15

9.2

Angle (× π/2)0 0.5 1

1.5

1.55

1.6

1.65

Angle (× π/2)

Care needed when using fixed contrasts!

Contrast functions using entropy estimates

• Simplest option: convolve with spline kernel, then compute discrete

entropy via space partition [Pham, 2004]

0 0.2 0.4 0.6 0.8 12.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3

Angle (× π/2)

MIC

A c

ontr

ast

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input sources

source 1

sourc

e 2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

After decorrelation

mixture 1

mix

ture

2

Contrast functions using spacings entropy estimate

• More sophisticated option: spacings estimate of entropy

[Learned-Miller and Fisher III, 2003]

0 0.2 0.4 0.6 0.8 1−3.4

−3.3

−3.2

−3.1

−3

−2.9

−2.8x 10

4

Angle (× π/2)

RA

DIC

AL c

ontr

ast

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Input sources

source 1

sourc

e 2

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

After decorrelation

mixture 1

mix

ture

2

Contrast functions using spacings entropy estimate

• More sophisticated option: spacings estimate of entropy

[Learned-Miller and Fisher III, 2003]

• Sort sample Y1, . . . , Ym in increasing order: Y(i) ≤ Y(i+1)

• Prob. density estimate based on spacings

P(y;Y1, . . . , Ym) =1

(m + 1)(Y(i+1) − Y(i)), Y(i) ≤ y < Y(i+1)

• Entropy estimate based on spacings

h(Y ) =1

m − 1

m−1∑

i=1

log(m + 1)(Y(i+1) − Y(i))

• Smoothing: add “extra” mixture points (noisy copies of original

mixtures)

• Hard to optimize

Other independence measures as contrasts

• Why mutual information?

– Same as maximum likelihood (good if model is correct)

– Contrast function is sum of entropies: fast

• Other independence measures?

Other independence measures as contrasts

• Why mutual information?

– Same as maximum likelihood (good if model is correct)

– Contrast function is sum of entropies: fast

• Other independence measures?

• Most common: kernel/characteristic function-based

– Characteristic function-based ICA [Eriksson and Koivunen, 2003, Chen and Bickel,

2005]

– Kernel ICA (covariance): COCO, KMI, HSIC [Gretton et al., 2005, Shen et al.,

2007, 2009]

– Kernel ICA (correlation): KCCA, KGV [Bach and Jordan, 2002]

• HSIC same as characteristic function-based (for the purposes of ICA) [Shen et al.,

2009]

Kernel contrast function: HSIC

• Dependence measure:

HSIC(PUV , F ) :=

(supf∈F

[EUV f − EUEV f ]

)2

Y(1)

Y(2

)Dependence witness and sample

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04

0.05

HSIC: empirical expression

• Empirical HSIC:

HSIC :=1

m2tr(KHLH)

– K Gram matrix for (u1, . . . , um)

– L Gram matrix for (v1, . . . , vm)

– Centering H = I − 1m1m1

⊤m

Contrast functions: a small selection

Contrast functions

• Sum of expectations of a fixed nonlinearity

– Fast ICA, Infomax, Jade

• Sum of entropies/mutual information. . .

– . . . using fast, smoothed entropy estimates

– . . . using spacings/k-nn entropy estimates

• Kernel/characteristic function dependence measures

Contrast functions: a small selection

Contrast functions

• Sum of expectations of a fixed nonlinearity

– Fast ICA, Infomax, Jade

• Sum of entropies/mutual information. . .

– . . . using fast, smoothed entropy estimates

– . . . using spacings/k-nn entropy estimates

• Kernel/characteristic function dependence measures

How do we optimize?

Optimization (Jacobi)

• For two signals, the rotation is expressed

B =

cos(θ) − sin(θ)

sin(θ) cos(θ)

• Higher dimensions, eg for l = 3,

B :=

2

6

6

4

cos(θz) − sin(θz) 0

sin(θz) cos(θz) 0

0 0 1

3

7

7

5

×

2

6

6

4

cos(θy) 0 − sin(θy)

0 1 0

sin(θy) 0 cos(θy)

3

7

7

5

×

2

6

6

4

1 0 0

0 cos(θx) − sin(θx)

0 sin(θx) cos(θx)

3

7

7

5

• Coordinate descent, exhaustive search, etc...

Optimization (Newton)

• Unmixing matrix B satisfies B⊤B = I

• Local parameterisation Ω about B: at iteration k,

Bk+1 = Bk exp(Ω) Ω = −Ω⊤

• How to choose direction and size of Ω?

Optimization (Newton)

• Unmixing matrix B satisfies B⊤B = I

• Local parameterisation Ω about B: at iteration k,

Bk+1 = Bk exp(Ω) Ω = −Ω⊤

• How to choose direction and size of Ω?

• Newton-like method: solve the linear system for Ω ∈ Rm(m−1)/2

HBk(φ)Ω = −∇Bk

(φ)

• Approximate Hessian as diagonal: FastICA [Shen and Huper, 2006]

Gradient descent vs Newton

5 10 15 20 25 30

10−0.4

10−0.3

10−0.2

10−0.1

100

Iteration

10

0x A

ma

ri e

rro

r

Newton

Gradient descent

What if we have time dependence?

• We can get extra information from sources not being i.i.d.

• Mixture x(t) now stationary random process, depends on x(t − τ)

• Define mixture covariances

C0 = E(x(t)x(t)), Cτ = E(x(t)x(t − τ)),

– Cτ indpendent of t (stationarity)

What if we have time dependence?

• We can get extra information from sources not being i.i.d.

• Mixture x(t) now stationary random process, depends on x(t − τ)

• Define mixture covariances

C0 = E(x(t)x(t)), Cτ = E(x(t)x(t − τ)),

– Cτ indpendent of t (stationarity)

• Decorrelate:

BC0B⊤ = Λ BCτB

⊤ = Λ

– Λ and Λ diagonal

• Combining both requirements:

BC0C−1τ =

(ΛΛ−1

)B

• Greater number of delays: joint diagonalisation

What’s the best method?

A basic benchmark

• l = 8 sources

• m = 40, 000 samples

• Benchmark data from

[Bach and Jordan, 2002]

• Average over 24 repetitions

A basic benchmark: results

A basic benchmark: results

Adaptive contrasts outperform fixed nonlinearities

Jade KDICA MICA FKICA QGD RAD MILCA

0.2

0.4

0.6

0.8

1

1.2

1.4

100x A

mari e

rror

Demixing quality

Adaptive contrasts

A basic benchmark: computational cost


0.2

0.4

0.6

0.8

1

1.2

1.4

100x A

mari e

rror

Demixing quality


100

101

102

103

104

105

se

c

Run times


Best runtime (adaptive): fast entropy estimates


0.2

0.4

0.6

0.8

1

1.2

1.4

100x A

mari e

rror

Demixing quality


100

101

102

103

104

105

se

c

Run times

Fast entropy esimates


Kernel methods: Newton outperforms Gradient Descent


0.2

0.4

0.6

0.8

1

1.2

1.4

100x A

mari e

rror

Demixing quality


100

101

102

103

104

105

se

c

Run times

Kernel methods


Spacings/k-nn entropy contrasts slowest


0.2

0.4

0.6

0.8

1

1.2

1.4

100x A

mari e

rror

Demixing quality


100

101

102

103

104

105

se

c

Run times

Spacings/k−nn

High frequency perturbations

• Two sources, sinusoidal perturbations to Gaussian

• Random mixing angle.

• Results averaged over 25 datasets, m = 1000

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

s

p(s

)

Source density

−4 −3 −2 −1 0 1 2 3 40

0.01

0.02

0.03

0.04

0.05

f

F[p

](f)

Source spectrum


0.2 0.4 0.6 0.8 1

5

10

15

20

25

30

35

40

45

50

Perturbation frequency

100 x

Am

ari e

rror

FKICA

MICA

KDICA

MILCA

RAD


Spacings/k-nn methods perform best(but slow)

0.2 0.4 0.6 0.8 1

5

10

15

20

25

30

35

40

45

50


100 x

Am

ari e

rror

FKICA

MICA

KDICA

MILCA

RAD


Fast entropy estimates: narrowest range

0.2 0.4 0.6 0.8 1

5

10

15

20

25

30

35

40

45

50


100 x

Am

ari e

rror

FKICA

MICA

KDICA

MILCA

RAD


Fast Kernel ICA: peforms in between(good performance/runtime tradeoff)

0.2 0.4 0.6 0.8 1

5

10

15

20

25

30

35

40

45

50


100 x

Am

ari e

rror

FKICA

MICA

KDICA

MILCA

RAD

Outlier resistance

Two sources, outliers added to both mixtures

5 10 15 20 25

5

10

15

20

25

number of outliers

10

0 x

Am

ari e

rro

r

FKICA

MICA

KDICA

MILCA

RAD

Outlier resistance

Kernel ICA performs best

5 10 15 20 25

5

10

15

20

25

number of outliers

10

0 x

Am

ari e

rro

r

FKICA

MICA

KDICA

MILCA

RAD

Outlier resistance

Fast entropy estimates: less goodKDICA initialized with kernel ICA solution!

5 10 15 20 25

5

10

15

20

25

number of outliers

10

0 x

Am

ari e

rro

r

FKICA

MICA

KDICA

MILCA

RAD

ICA algorithm choice

• Choosing kernel ICA approach

– Fastest (by far): Fast ICA [Hyvarinen et al., 2001], Jade [Cardoso, 1998]

– Good tradeoff between speed and performance: MICA [Pham, 2004]

– Tricky cases (outliers, non-smooth sources): Fast KICA [Shen et al., 2007,

2009]

– Small sample size: KGV very good [Bach and Jordan, 2002]






2009]


• Some further hints:

– Use multiple restarts (non-convex)

– Independence test to check answer






2009]


• Some further hints:

– Use multiple restarts (non-convex)

– Independence test to check answer

• Comparing (usually fixed contrast) algorithms:

– One approach “better” than another?

– Example: sources l very large, samples m small (wrt l), e.g.

microarray data [Lee and Batzoglou, 2003]

Selected ICA references

• Start with Cardoso’s excellent introduction [Cardoso, 1998], and the book by

Hyvarninen et al. [Hyvarinen et al., 2001]

• Fast kernel ICA is described in [Shen et al., 2007, 2009]. Characteristic

function-based ICA is described in [Eriksson and Koivunen, 2003, Chen and Bickel, 2005].

For earlier kernel ICA methods, see [Bach and Jordan, 2002, Gretton et al., 2005]

• Mutual information/entropy based: [Pham, 2004, Learned-Miller and Fisher III, 2003,

Stogbauer et al., 2004, Chen, 2006]

• Classic algorithms for time series separation with second order methods

(not covered much in this talk): [Molgedey and Schuster, 1994, Belouchrani et al., 1997]

• An important paper for optimising over orthogonal matrices: [Edelman et al.,

1998]. The Newton-like method: [Huper and Trumpf, 2004].

Refe

rences

F.R

.Bach

and

M.I.

Jordan.K

ernelin

dependentcom

ponentanaly

sis

.Journal

ofM

achin

eLea

rnin

gResea

rch,3:1

–48,2002.

A.Belo

uchrani,

K.A

bed-M

eraim

,J.-F

.C

ardoso,and

E.M

oulin

es.

Ablin

dsource

separatio

ntechniq

ue

usin

gsecond

order

statis

tic

s.

IEEE

Tra

nsa

c-

tions

on

Sig

nalPro

cessin

g,45(2):4

34–444,1997.

M.Bethge.Factoria

lcodin

gofnaturalim

ages:

how

effe

ctiv

eare

linearm

od-

els

inrem

ovin

ghig

her-o

rder

dependencie

s?

Journalofth

eO

ptica

lSocie

tyof

Am

erica

A,23(6):1

253–1268,2006.

V.C

alh

oun,T

.A

dali,

L.H

ansen,J.Larsen,and

J.Pekar.

Ica

offu

nctio

nal

mri

data:

An

overvie

w.

InPro

c.

4th

Int.

Sym

p.

on

Independent

Com

ponent

Analy

sisand

Blin

dSig

nalSepara

tion

(ICA

2003),

pages

281–288,2003.

J.-F

.C

ardoso.

Blin

dsig

nalseparatio

n:

Statis

tic

alprin

cip

les.

Pro

ceedin

gs

of

the

IEEE,Spec

ialIssu

eon

Blin

dId

entifi

catio

nand

Estim

atio

n,86(10):2

009–2025,

1998.

A.

Chen.

Fast

kernel

density

independent

com

ponent

analy

sis

.In

Six

thIn

ternatio

nal

Confe

rence

on

ICA

and

BSS,

volu

me

3889,

pages

24–31,

Berlin

/H

eid

elb

erg,2006.Sprin

ger-V

erla

g.

A.C

hen

and

P.J.Bic

kel.

Consis

tent

independent

com

ponent

analy

sis

and

prew

hit

enin

g.

IEEE

Tra

nsa

ctio

ns

on

Sig

nal

Pro

cessin

g,

53(10):3

625–3632,

2005.

A.Edelm

an,T

.A

.A

ria

s,and

S.T

.Sm

ith.T

he

geom

etry

ofalg

orit

hm

sw

ith

orthogonalit

yconstrain

ts.

SIA

MJournalon

Matr

ixAnaly

sisand

Applica

tions,

20(2):3

03–353,1998.

J.

Erik

sson

and

K.

Koiv

unen.

Characteris

tic

-functio

nbased

independent

com

ponent

analy

sis

.Sig

nalPro

cessin

g,83(10):2

195–2208,2003.

A.G

retton,R

.H

erbric

h,A

.J.Sm

ola

,O

.Bousquet,and

B.Scholk

opf.

Kernel

methods

for

measurin

gin

dependence.

JournalofM

achin

eLea

rnin

gResea

rch,

6:2

075–2129,2005.

K.H

uper

and

J.Trum

pf.

New

ton-lik

em

ethods

for

num

eric

aloptim

isatio

non

manifo

lds.

InPro

ceedin

gs

of

Thir

ty-e

ighth

Asilo

mar

Confe

rence

on

Sig

nals,

Syste

ms

and

Com

pute

rs,

pages

136–139,2004.

A.H

yvarin

en,J.K

arhunen,and

E.O

ja.In

dependentCom

ponentAnaly

sis.W

iley,

New

York,2001.

42-1

T.-P

.Jung,S.M

akeig

,C

.H

um

phrie

s,T

.-W.Lee,M

.M

cK

eow

n,V

.Ir

agui,

and

T.Sejn

ow

ski.

Rem

ovin

gele

ctroencephalo

graphic

artifa

cts

by

blin

dsource

separatio

n.

Psy

chophysio

logy,37:1

63–178,2000.

K.

Kiv

iluoto

and

E.

Oja

.In

dependent

com

ponent

analy

sis

for

paralle

lfi-

nancia

ltim

eserie

s.

InIn

Pro

c.

Int.

Conf.

on

Neura

lIn

form

atio

nPro

cessin

g(IC

ON

IP),

volu

me

2,pages

895–898,1998.

E.G

.Learned-M

iller

and

J.W

.Fis

her

III.

ICA

usin

gspacin

gs

estim

ates

of

entropy.

JournalofM

achin

eLea

rnin

gResea

rch,4:1

271–1295,2003.

S.I.

Lee

and

S.Batzoglo

u.

Applic

atio

nof

independent

com

ponent

analy

sis

to

mic

roarrays.

Genom

eBio

logy,4(11):R

76,2003.

L.M

olg

edey

and

H.Schuster.Separatio

nofa

mix

ture

ofin

dependentsig

nals

usin

gtim

edela

yed

correla

tio

n.

Physica

lRevie

wLette

rs,

72(23):3

634–3637,

1994.

D.-T

.Pham

.Fast

alg

orit

hm

sfo

rm

utual

info

rm

atio

nbased

independent

com

ponent

analy

sis

.IE

EE

Tra

nsa

ctio

ns

on

Sig

nalPro

cessin

g,

52(10):2

690–

2700,2004.

H.Shen

and

K.H

uper.

New

ton-lik

em

ethods

for

paralle

lin

dependent

com

-ponent

analy

sis

.In

MLSP

16,pages

283–288,M

aynooth,Ir

ela

nd,2006.

H.Shen,S.Jegelk

a,and

A.G

retton.

Fast

kernelIC

Ausin

gan

approxim

ate

New

ton

method.

InA

ISTATS

11,pages

476–483.M

icrotom

e,2007.

H.Shen,S.Jegelk

a,and

A.G

retton.

Fast

kernel-b

ased

independent

com

po-

nent

analy

sis

.IE

EE

Tra

nsa

ctio

ns

on

Sig

nalPro

cessin

g,2009.

Inpress.

H.Stogbauer,A

.K

raskov,S.A

stakhov,and

P.G

rassberger.Leastdependent

com

ponent

analy

sis

based

on

mutual

info

rm

atio

n.

Phys.

Rev.

E,

70(6):

066123,2004.

42-2

independent component analysisarthurg/papers/icatutorial.pdf · ica: setting independent component...

Documents