independent component analysisarthurg/papers/icatutorial.pdf · ica: setting independent component...
TRANSCRIPT
Independent Component Analysis
Arthur Gretton
Carnegie Mellon University
2009
ICA: setting
Independent component analysis:
• S a vector of l unknown, independent sources: PS =∏l
i=1 PSi
• X vector of mixtures
• A is l × l mixing matrix (full rank)
ICA: setting
Independent component analysis:
• B is estimated A−1, we solve for this
• Y vector of estimated sources
ICA: setting
Independent component analysis:
• B is estimated A−1, we solve for this
• Y vector of estimated sources
Neglect time dependence: m i.i.d. mixture observations
ICA: another example
• Mixtures X are
original EEG
[Jung et al., 2000]
• Estimated sources
Y are ICA
components
• Scalp map from B
ICA examples
• We’ve seen:
– Sounds mixed together (“cocktail party” problem) [Hyvarinen et al., 2001]
– EEG recordings (brain, fetal heartbeat) [Jung et al., 2000, Stogbauer et al., 2004]
• Some further examples:
– Extracting independent activity from fMRI [Calhoun et al., 2003]
– Financial data [Kiviluoto and Oja, 1998]
– Linear edge filters for image patch coding? (Possibly not: [Bethge, 2006])
A toy example
• Two distributions: PS1 is uniform, PS2 is bimodal
−2 −1 0 1 20
20
40
60
80
100
Source 1, uniform
−2 −1 0 1 20
50
100
150
200
250
Source 2, bimodal
A toy example
• Two distributions: PS1 is uniform, PS2 is bimodal
A toy example
• Two distributions: PS1 is uniform, PS2 is bimodal
First indeterminacy: ordering
• Initial unmixed RVs in red
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Input sources
source 1
so
urc
e 2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Rotation π/6
mixture 1
mix
ture
2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Rotation π/4
mixture 1
mix
ture
2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Rotation π/3
mixture 1
mix
ture
2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Rotation π/2
mixture 1
mix
ture
2
• Independent at rotation π/2
First indeterminacy: ordering
• Initial unmixed RVs in red
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Input sources
source 1
so
urc
e 2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Rotation π/6
mixture 1
mix
ture
2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Rotation π/4
mixture 1
mix
ture
2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Rotation π/3
mixture 1
mix
ture
2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Rotation π/2
mixture 1
mix
ture
2
• Independent at rotation π/2
Ignore source order
Second indeterminacy: sign
• Initial unmixed RVs in red
• Source 2 sign reversed in blue
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Input sources
source 1
sourc
e 2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Mixture
mixture 1
mix
ture
2
Second indeterminacy: sign
• Initial unmixed RVs in red
• Source 2 sign reversed in blue
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Input sources
source 1
sourc
e 2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Mixture
mixture 1
mix
ture
2
Ignore source sign
Second indeterminacy: sign
• Initial unmixed RVs in red
• Source 2 sign reversed in blue
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Input sources
source 1
sourc
e 2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Mixture
mixture 1
mix
ture
2
• More generally: S1 and S2 independent iff
aS1 and S2 independent for a 6= 0
– Assume sources have unit variance
Third indeterminacy: Gaussians
Both sources GaussianSource distribution P
S1 P
S2
Source S1
So
urc
e S
2
−2 0 2−3
−2
−1
0
1
2
3
−4 −2 0 2 40
0.1
0.2
0.3
0.4
Source S1
P(S
1)
−4 −2 0 2 40
0.1
0.2
0.3
0.4
Source S2
P(S
2)
Third indeterminacy: Gaussians
Both sources GaussianSource distribution P
S1 P
S2
Source S1
So
urc
e S
2
−2 0 2−3
−2
−1
0
1
2
3
−4 −2 0 2 40
0.1
0.2
0.3
0.4
Source S1
P(S
1)
−4 −2 0 2 40
0.1
0.2
0.3
0.4
Source S2
P(S
2)
Meaningless to “unmix” Gaussians
Things that are impossible for ICA
Using independence alone, we cannot . . .
• recover signal order,
• recover signal sign (or amplitude) ,
• separate multiple Gaussians.
Things that are impossible for ICA
Using independence alone, we cannot . . .
• recover signal order,
• recover signal sign (or amplitude) ,
• separate multiple Gaussians.
We can recover
B∗ = PDA−1
• P is a permutation matrix
• D diagonal, dii ∈ −1, 1
(as long as no more than one Gaussian source)
First step in ICA: decorrelate
• Idea: remove all dependencies of order 2 between mixtures X
First step in ICA: decorrelate
• Idea: remove all dependencies of order 2 between mixtures X
−2 −1 0 1 2−1.5
−1
−0.5
0
0.5
1
1.5
X
Y
Uncorrelated
−10 −5 0 5 10−6
−4
−2
0
2
4
6
A X
A Y
Correlated
First step in ICA: decorrelate
• Idea: remove all dependencies of order 2 between mixtures X
• New signals have unit covariance:
T = BwX Ct = I
• We thus break up B as follows:
B = BrBw
– Bw is a whitening matrix
– Br is remaining demixing operation
• Use the SVD of mixture covariance Cx = UΛU⊤:
Bw = Λ−1/2U
⊤
What does decorrelation achieve?
• Two distributions: PS1 is uniform, PS2 is bimodal
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Input sources
source 1
sourc
e 2
−20 −10 0 10 20−4
−3
−2
−1
0
1
2
3
4
Observed mixtures
mixture 1
mix
ture
2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
After decorrelation
mixture 1
mix
ture
2
Problem remaining: rotation
• Assume correlation has already been removed
• To recover original signal, need to rotate
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Input sources
source 1
sourc
e 2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
After decorrelation
mixture 1m
ixtu
re 2
• In remainder: unmixing matrix B is rotation,
B⊤B = I
ICA: maximum likelihood
• Model for mixtures parametrised by (B, PS)
Source distribution PS1
PS2
Source S1
Sourc
e S
2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
ICA: maximum likelihood
• Model for mixtures parametrised by (B, PS)
Source distribution PS1
PS2
Source S1
Sourc
e S
2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0 0.5 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
Angle (× π/2)
Log lik
elih
ood
Unmixing angle for B: 0
ICA: maximum likelihood
• Model for mixtures parametrised by (B, PS)
Source distribution PS1
PS2
Source S1
Sourc
e S
2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0 0.5 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
Angle (× π/2)
Log lik
elih
ood
Unmixing angle for B: π/12
ICA: maximum likelihood
• Model for mixtures parametrised by (B, PS)
Source distribution PS1
PS2
Source S1
Sourc
e S
2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0 0.5 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
Angle (× π/2)
Log lik
elih
ood
Unmixing angle for B: π/4
ICA: maximum likelihood
• We have a model for the observations, parametrised by (B, PS)
– Model must have PS =∏l
i=1 PSi
ICA: maximum likelihood
• We have a model for the observations, parametrised by (B, PS)
– Model must have PS =∏l
i=1 PSi
• With this model, our estimated density of observations is
PX = |det(B)| PS(BX)
ICA: maximum likelihood
• We have a model for the observations, parametrised by (B, PS)
– Model must have PS =∏l
i=1 PSi
• With this model, our estimated density of observations is
PX =»
»»
»»|det(B)| PS(BX)
ICA: maximum likelihood
• We have a model for the observations, parametrised by (B, PS)
– Model must have PS =∏l
i=1 PSi
• With this model, our estimated density of observations is
PX = PS(BX)
• Maximise the expected log likelihood,
L := EX
[log PX
]
Maximum likelihood: where it fails
• Model as before, but true source densities are Laplace.
• Why is this so wrong?
−10 −5 0 5 10−10
−5
0
5
10
Input sources
source 1
sour
ce 2
−5 0 5−8
−6
−4
−2
0
2
4
6
8
Max. likelihood solution
est. source 1es
t. so
urce
2
Maximum likelihood: where it fails
• Model as before, but true source densities are Laplace.
• Why is this so wrong?
Source distribution PS1
PS2
Source S1
Sou
rce
S2
−5 0 5−5
0
5
0 0.5 1−1.95
−1.9
−1.85
−1.8
−1.75
−1.7
−1.65
−1.6
−1.55
Angle (× π/2)Lo
g lik
elih
ood
Back to original setting: independence
• Ideally: contrast φ(Y ) = 0 if and only if all components of Y mutually
independent:
PY =l∏
i=1
PYi.
– Under our mixing assumptions: Y are original sources S besides
permutations, sign swaps
Back to original setting: independence
• Ideally: contrast φ(Y ) = 0 if and only if all components of Y mutually
independent:
PY =l∏
i=1
PYi.
– Under our mixing assumptions: Y are original sources S besides
permutations, sign swaps
• How it’s really used: contrast should be “smallest” when random
variables are “most independent”
Mutual information
• The mutual information:
I(Y ) = DKL
(PY
∥∥∥∥∥
l∏
i=1
PYi
)
• DKL ≥ 0 with equality iff PY =∏l
i=1 PYi
Mutual information
• The mutual information:
I(Y ) = DKL
(PY
∥∥∥∥∥
l∏
i=1
PYi
)
• DKL ≥ 0 with equality iff PY =∏l
i=1 PYi
• Simplification: when B is a rotation,
DKL
(PY
∥∥∥∥∥
l∏
i=1
PYi
)=
l∑
i=1
h (Yi) − h (X) − log |detB| .
where h(Y ) = −EY log(PY (y))
Mutual information
• The mutual information:
I(Y ) = DKL
(PY
∥∥∥∥∥
l∏
i=1
PYi
)
• DKL ≥ 0 with equality iff PY =∏l
i=1 PYi
• Simplification: when B is a rotation,
DKL
(PY
∥∥∥∥∥
l∏
i=1
PYi
)=
l∑
i=1
h (Yi) − h (X) − log |detB| .︸ ︷︷ ︸constant
where h(Y ) = −EY log(PY (y))
Mutual information
• The mutual information:
I(Y ) = DKL
(PY
∥∥∥∥∥
l∏
i=1
PYi
)
• DKL ≥ 0 with equality iff PY =∏l
i=1 PYi
• Simplification: when B is a rotation,
DKL
(PY
∥∥∥∥∥
l∏
i=1
PYi
)=
l∑
i=1
h (Yi) − h (X) − log |detB| .︸ ︷︷ ︸constant
where h(Y ) = −EY log(PY (y))
Contrast: φKL(Y ) :=∑l
i=1 h (Yi)
Maximum likelihood revisited
• Mutual information contrast: minimize
φKL(Y ) :=l∑
i=1
−EY log(PY (y))
• Maximum likelihood: maximize
L := EX
[log PS(BX)
]
=l∑
i=1
EY log(PY (y))
• Same thing!
Maximum likelihood revisited
• Mutual information contrast: minimize
φKL(Y ) :=l∑
i=1
−EY log(PY (y))
• Maximum likelihood: maximize
L := EX
[log PS(BX)
]
=l∑
i=1
EY log(PY (y))
• Same thing!
• The difference is in approach:
– For max. likelihood we assumed a model PS
– Now we assume no model for PY (though we still make assumptions)
Contrast functions with fixed nonlinearities
• Entropies hard to compute/optimize: replace with
φf (Y ) =l∑
j=1
E(f(Yj))
for some other nonlinear f(y)
2 22
Contrast functions with fixed nonlinearities
• Entropies hard to compute/optimize: replace with
φf (Y ) =l∑
j=1
E(f(Yj))
for some other nonlinear f(y)
−2 −1 0 1 20
2
4
6
8
10
12
14
16
Jade
f(y) = y4
−2 −1 0 1 29
9.2
9.4
9.6
9.8
10
Infomax
f(y) = a−exp(−y2/2)sech2(y)
−2 −1 0 1 20
0.5
1
1.5
2
Fast ICA
f(y) =1
alog cosh(ay),
Our example again
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Input sources
source 1
sourc
e 2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
After decorrelation
mixture 1
mix
ture
2
0 0.5 12.5
3
3.5
4
4.5
5
Contr
ast
Jade
Angle (× π/2)0 0.5 1
8.9
9
9.1
9.2
9.3
9.4
Imax
Angle (× π/2)
Contr
ast
0 0.5 1
1.5
1.55
1.6
1.65
1.7
1.75
1.8
Fica
Angle (× π/2)
Contr
ast
Kurtosis: an important concept
• Kurtosis definition: when mean is zero,
κ4 = E(x4)− 3
(E
(x2))2
.
• Source densities can be super-Gaussian (positive kurtosis) or
sub-Gaussian (negative kurtosis)
• Zero kurtosis does not mean Gaussian!
−5 0 50
0.05
0.1
0.15
0.2
0.25
S
p(S
)
Sub−Gaussian (uniform)
−5 0 50
0.1
0.2
0.3
0.4
0.5
S
p(S
)
Super−Gaussian (Laplace)
Demo: contrasts with fixed nonlinearities
• Super-Gaussian (Laplace) and sub-Gaussian (Uniform) sources
• Unmixed sources in red
• Mixture (angle π/6) in black
−10 −5 0 5 10−8
−6
−4
−2
0
2
4
6
Super−Gaussian sources
−4 −2 0 2 4−3
−2
−1
0
1
2
3
Sub−Gaussian sources
Demo: contrasts with fixed nonlinearities
• Results for Jade, Infomax, and Fast ICA contrasts
0 0.5 1200
250
300
350
400
450
500
Su
pe
r−G
au
ssia
n
Jade
0 0.5 1
8.64
8.66
8.68
8.7
8.72
8.74
Imax
0 0.5 11.02
1.04
1.06
1.08
1.1
1.12
1.14
1.16
Fica
0 0.5 13.5
4
4.5
5
Su
b−
Ga
uss
ian
Angle (× π/2)0 0.5 1
9
9.05
9.1
9.15
9.2
Angle (× π/2)0 0.5 1
1.5
1.55
1.6
1.65
Angle (× π/2)
Care needed when using fixed contrasts!
Contrast functions using entropy estimates
• Simplest option: convolve with spline kernel, then compute discrete
entropy via space partition [Pham, 2004]
0 0.2 0.4 0.6 0.8 12.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
Angle (× π/2)
MIC
A c
ontr
ast
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Input sources
source 1
sourc
e 2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
After decorrelation
mixture 1
mix
ture
2
Contrast functions using spacings entropy estimate
• More sophisticated option: spacings estimate of entropy
[Learned-Miller and Fisher III, 2003]
0 0.2 0.4 0.6 0.8 1−3.4
−3.3
−3.2
−3.1
−3
−2.9
−2.8x 10
4
Angle (× π/2)
RA
DIC
AL c
ontr
ast
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Input sources
source 1
sourc
e 2
−2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
After decorrelation
mixture 1
mix
ture
2
Contrast functions using spacings entropy estimate
• More sophisticated option: spacings estimate of entropy
[Learned-Miller and Fisher III, 2003]
• Sort sample Y1, . . . , Ym in increasing order: Y(i) ≤ Y(i+1)
• Prob. density estimate based on spacings
P(y;Y1, . . . , Ym) =1
(m + 1)(Y(i+1) − Y(i)), Y(i) ≤ y < Y(i+1)
• Entropy estimate based on spacings
h(Y ) =1
m − 1
m−1∑
i=1
log(m + 1)(Y(i+1) − Y(i))
• Smoothing: add “extra” mixture points (noisy copies of original
mixtures)
• Hard to optimize
Other independence measures as contrasts
• Why mutual information?
– Same as maximum likelihood (good if model is correct)
– Contrast function is sum of entropies: fast
• Other independence measures?
Other independence measures as contrasts
• Why mutual information?
– Same as maximum likelihood (good if model is correct)
– Contrast function is sum of entropies: fast
• Other independence measures?
• Most common: kernel/characteristic function-based
– Characteristic function-based ICA [Eriksson and Koivunen, 2003, Chen and Bickel,
2005]
– Kernel ICA (covariance): COCO, KMI, HSIC [Gretton et al., 2005, Shen et al.,
2007, 2009]
– Kernel ICA (correlation): KCCA, KGV [Bach and Jordan, 2002]
• HSIC same as characteristic function-based (for the purposes of ICA) [Shen et al.,
2009]
Kernel contrast function: HSIC
• Dependence measure:
HSIC(PUV , F ) :=
(supf∈F
[EUV f − EUEV f ]
)2
Y(1)
Y(2
)Dependence witness and sample
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
0.04
0.05
HSIC: empirical expression
• Empirical HSIC:
HSIC :=1
m2tr(KHLH)
– K Gram matrix for (u1, . . . , um)
– L Gram matrix for (v1, . . . , vm)
– Centering H = I − 1m1m1
⊤m
Contrast functions: a small selection
Contrast functions
• Sum of expectations of a fixed nonlinearity
– Fast ICA, Infomax, Jade
• Sum of entropies/mutual information. . .
– . . . using fast, smoothed entropy estimates
– . . . using spacings/k-nn entropy estimates
• Kernel/characteristic function dependence measures
Contrast functions: a small selection
Contrast functions
• Sum of expectations of a fixed nonlinearity
– Fast ICA, Infomax, Jade
• Sum of entropies/mutual information. . .
– . . . using fast, smoothed entropy estimates
– . . . using spacings/k-nn entropy estimates
• Kernel/characteristic function dependence measures
How do we optimize?
Optimization (Jacobi)
• For two signals, the rotation is expressed
B =
cos(θ) − sin(θ)
sin(θ) cos(θ)
• Higher dimensions, eg for l = 3,
B :=
2
6
6
4
cos(θz) − sin(θz) 0
sin(θz) cos(θz) 0
0 0 1
3
7
7
5
×
2
6
6
4
cos(θy) 0 − sin(θy)
0 1 0
sin(θy) 0 cos(θy)
3
7
7
5
×
2
6
6
4
1 0 0
0 cos(θx) − sin(θx)
0 sin(θx) cos(θx)
3
7
7
5
• Coordinate descent, exhaustive search, etc...
Optimization (Newton)
• Unmixing matrix B satisfies B⊤B = I
• Local parameterisation Ω about B: at iteration k,
Bk+1 = Bk exp(Ω) Ω = −Ω⊤
• How to choose direction and size of Ω?
Optimization (Newton)
• Unmixing matrix B satisfies B⊤B = I
• Local parameterisation Ω about B: at iteration k,
Bk+1 = Bk exp(Ω) Ω = −Ω⊤
• How to choose direction and size of Ω?
• Newton-like method: solve the linear system for Ω ∈ Rm(m−1)/2
HBk(φ)Ω = −∇Bk
(φ)
• Approximate Hessian as diagonal: FastICA [Shen and Huper, 2006]
Gradient descent vs Newton
5 10 15 20 25 30
10−0.4
10−0.3
10−0.2
10−0.1
100
Iteration
10
0x A
ma
ri e
rro
r
Newton
Gradient descent
What if we have time dependence?
• We can get extra information from sources not being i.i.d.
• Mixture x(t) now stationary random process, depends on x(t − τ)
• Define mixture covariances
C0 = E(x(t)x(t)), Cτ = E(x(t)x(t − τ)),
– Cτ indpendent of t (stationarity)
What if we have time dependence?
• We can get extra information from sources not being i.i.d.
• Mixture x(t) now stationary random process, depends on x(t − τ)
• Define mixture covariances
C0 = E(x(t)x(t)), Cτ = E(x(t)x(t − τ)),
– Cτ indpendent of t (stationarity)
• Decorrelate:
BC0B⊤ = Λ BCτB
⊤ = Λ
– Λ and Λ diagonal
• Combining both requirements:
BC0C−1τ =
(ΛΛ−1
)B
• Greater number of delays: joint diagonalisation
What’s the best method?
A basic benchmark
• l = 8 sources
• m = 40, 000 samples
• Benchmark data from
[Bach and Jordan, 2002]
• Average over 24 repetitions
A basic benchmark: results
A basic benchmark: results
Adaptive contrasts outperform fixed nonlinearities
Jade KDICA MICA FKICA QGD RAD MILCA
0.2
0.4
0.6
0.8
1
1.2
1.4
100x A
mari e
rror
Demixing quality
Adaptive contrasts
A basic benchmark: computational cost
Jade KDICA MICA FKICA QGD RAD MILCA
0.2
0.4
0.6
0.8
1
1.2
1.4
100x A
mari e
rror
Demixing quality
Jade KDICA MICA FKICA QGD RAD MILCA
100
101
102
103
104
105
se
c
Run times
A basic benchmark: computational cost
Best runtime (adaptive): fast entropy estimates
Jade KDICA MICA FKICA QGD RAD MILCA
0.2
0.4
0.6
0.8
1
1.2
1.4
100x A
mari e
rror
Demixing quality
Jade KDICA MICA FKICA QGD RAD MILCA
100
101
102
103
104
105
se
c
Run times
Fast entropy esimates
A basic benchmark: computational cost
Kernel methods: Newton outperforms Gradient Descent
Jade KDICA MICA FKICA QGD RAD MILCA
0.2
0.4
0.6
0.8
1
1.2
1.4
100x A
mari e
rror
Demixing quality
Jade KDICA MICA FKICA QGD RAD MILCA
100
101
102
103
104
105
se
c
Run times
Kernel methods
A basic benchmark: computational cost
Spacings/k-nn entropy contrasts slowest
Jade KDICA MICA FKICA QGD RAD MILCA
0.2
0.4
0.6
0.8
1
1.2
1.4
100x A
mari e
rror
Demixing quality
Jade KDICA MICA FKICA QGD RAD MILCA
100
101
102
103
104
105
se
c
Run times
Spacings/k−nn
High frequency perturbations
• Two sources, sinusoidal perturbations to Gaussian
• Random mixing angle.
• Results averaged over 25 datasets, m = 1000
−10 −5 0 5 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
s
p(s
)
Source density
−4 −3 −2 −1 0 1 2 3 40
0.01
0.02
0.03
0.04
0.05
f
F[p
](f)
Source spectrum
High frequency perturbations
0.2 0.4 0.6 0.8 1
5
10
15
20
25
30
35
40
45
50
Perturbation frequency
100 x
Am
ari e
rror
FKICA
MICA
KDICA
MILCA
RAD
High frequency perturbations
Spacings/k-nn methods perform best(but slow)
0.2 0.4 0.6 0.8 1
5
10
15
20
25
30
35
40
45
50
Perturbation frequency
100 x
Am
ari e
rror
FKICA
MICA
KDICA
MILCA
RAD
High frequency perturbations
Fast entropy estimates: narrowest range
0.2 0.4 0.6 0.8 1
5
10
15
20
25
30
35
40
45
50
Perturbation frequency
100 x
Am
ari e
rror
FKICA
MICA
KDICA
MILCA
RAD
High frequency perturbations
Fast Kernel ICA: peforms in between(good performance/runtime tradeoff)
0.2 0.4 0.6 0.8 1
5
10
15
20
25
30
35
40
45
50
Perturbation frequency
100 x
Am
ari e
rror
FKICA
MICA
KDICA
MILCA
RAD
Outlier resistance
Two sources, outliers added to both mixtures
5 10 15 20 25
5
10
15
20
25
number of outliers
10
0 x
Am
ari e
rro
r
FKICA
MICA
KDICA
MILCA
RAD
Outlier resistance
Kernel ICA performs best
5 10 15 20 25
5
10
15
20
25
number of outliers
10
0 x
Am
ari e
rro
r
FKICA
MICA
KDICA
MILCA
RAD
Outlier resistance
Fast entropy estimates: less goodKDICA initialized with kernel ICA solution!
5 10 15 20 25
5
10
15
20
25
number of outliers
10
0 x
Am
ari e
rro
r
FKICA
MICA
KDICA
MILCA
RAD
ICA algorithm choice
• Choosing kernel ICA approach
– Fastest (by far): Fast ICA [Hyvarinen et al., 2001], Jade [Cardoso, 1998]
– Good tradeoff between speed and performance: MICA [Pham, 2004]
– Tricky cases (outliers, non-smooth sources): Fast KICA [Shen et al., 2007,
2009]
– Small sample size: KGV very good [Bach and Jordan, 2002]
ICA algorithm choice
• Choosing kernel ICA approach
– Fastest (by far): Fast ICA [Hyvarinen et al., 2001], Jade [Cardoso, 1998]
– Good tradeoff between speed and performance: MICA [Pham, 2004]
– Tricky cases (outliers, non-smooth sources): Fast KICA [Shen et al., 2007,
2009]
– Small sample size: KGV very good [Bach and Jordan, 2002]
• Some further hints:
– Use multiple restarts (non-convex)
– Independence test to check answer
ICA algorithm choice
• Choosing kernel ICA approach
– Fastest (by far): Fast ICA [Hyvarinen et al., 2001], Jade [Cardoso, 1998]
– Good tradeoff between speed and performance: MICA [Pham, 2004]
– Tricky cases (outliers, non-smooth sources): Fast KICA [Shen et al., 2007,
2009]
– Small sample size: KGV very good [Bach and Jordan, 2002]
• Some further hints:
– Use multiple restarts (non-convex)
– Independence test to check answer
• Comparing (usually fixed contrast) algorithms:
– One approach “better” than another?
– Example: sources l very large, samples m small (wrt l), e.g.
microarray data [Lee and Batzoglou, 2003]
Selected ICA references
• Start with Cardoso’s excellent introduction [Cardoso, 1998], and the book by
Hyvarninen et al. [Hyvarinen et al., 2001]
• Fast kernel ICA is described in [Shen et al., 2007, 2009]. Characteristic
function-based ICA is described in [Eriksson and Koivunen, 2003, Chen and Bickel, 2005].
For earlier kernel ICA methods, see [Bach and Jordan, 2002, Gretton et al., 2005]
• Mutual information/entropy based: [Pham, 2004, Learned-Miller and Fisher III, 2003,
Stogbauer et al., 2004, Chen, 2006]
• Classic algorithms for time series separation with second order methods
(not covered much in this talk): [Molgedey and Schuster, 1994, Belouchrani et al., 1997]
• An important paper for optimising over orthogonal matrices: [Edelman et al.,
1998]. The Newton-like method: [Huper and Trumpf, 2004].
Refe
rences
F.R
.Bach
and
M.I.
Jordan.K
ernelin
dependentcom
ponentanaly
sis
.Journal
ofM
achin
eLea
rnin
gResea
rch,3:1
–48,2002.
A.Belo
uchrani,
K.A
bed-M
eraim
,J.-F
.C
ardoso,and
E.M
oulin
es.
Ablin
dsource
separatio
ntechniq
ue
usin
gsecond
order
statis
tic
s.
IEEE
Tra
nsa
c-
tions
on
Sig
nalPro
cessin
g,45(2):4
34–444,1997.
M.Bethge.Factoria
lcodin
gofnaturalim
ages:
how
effe
ctiv
eare
linearm
od-
els
inrem
ovin
ghig
her-o
rder
dependencie
s?
Journalofth
eO
ptica
lSocie
tyof
Am
erica
A,23(6):1
253–1268,2006.
V.C
alh
oun,T
.A
dali,
L.H
ansen,J.Larsen,and
J.Pekar.
Ica
offu
nctio
nal
mri
data:
An
overvie
w.
InPro
c.
4th
Int.
Sym
p.
on
Independent
Com
ponent
Analy
sisand
Blin
dSig
nalSepara
tion
(ICA
2003),
pages
281–288,2003.
J.-F
.C
ardoso.
Blin
dsig
nalseparatio
n:
Statis
tic
alprin
cip
les.
Pro
ceedin
gs
of
the
IEEE,Spec
ialIssu
eon
Blin
dId
entifi
catio
nand
Estim
atio
n,86(10):2
009–2025,
1998.
A.
Chen.
Fast
kernel
density
independent
com
ponent
analy
sis
.In
Six
thIn
ternatio
nal
Confe
rence
on
ICA
and
BSS,
volu
me
3889,
pages
24–31,
Berlin
/H
eid
elb
erg,2006.Sprin
ger-V
erla
g.
A.C
hen
and
P.J.Bic
kel.
Consis
tent
independent
com
ponent
analy
sis
and
prew
hit
enin
g.
IEEE
Tra
nsa
ctio
ns
on
Sig
nal
Pro
cessin
g,
53(10):3
625–3632,
2005.
A.Edelm
an,T
.A
.A
ria
s,and
S.T
.Sm
ith.T
he
geom
etry
ofalg
orit
hm
sw
ith
orthogonalit
yconstrain
ts.
SIA
MJournalon
Matr
ixAnaly
sisand
Applica
tions,
20(2):3
03–353,1998.
J.
Erik
sson
and
K.
Koiv
unen.
Characteris
tic
-functio
nbased
independent
com
ponent
analy
sis
.Sig
nalPro
cessin
g,83(10):2
195–2208,2003.
A.G
retton,R
.H
erbric
h,A
.J.Sm
ola
,O
.Bousquet,and
B.Scholk
opf.
Kernel
methods
for
measurin
gin
dependence.
JournalofM
achin
eLea
rnin
gResea
rch,
6:2
075–2129,2005.
K.H
uper
and
J.Trum
pf.
New
ton-lik
em
ethods
for
num
eric
aloptim
isatio
non
manifo
lds.
InPro
ceedin
gs
of
Thir
ty-e
ighth
Asilo
mar
Confe
rence
on
Sig
nals,
Syste
ms
and
Com
pute
rs,
pages
136–139,2004.
A.H
yvarin
en,J.K
arhunen,and
E.O
ja.In
dependentCom
ponentAnaly
sis.W
iley,
New
York,2001.
42-1
T.-P
.Jung,S.M
akeig
,C
.H
um
phrie
s,T
.-W.Lee,M
.M
cK
eow
n,V
.Ir
agui,
and
T.Sejn
ow
ski.
Rem
ovin
gele
ctroencephalo
graphic
artifa
cts
by
blin
dsource
separatio
n.
Psy
chophysio
logy,37:1
63–178,2000.
K.
Kiv
iluoto
and
E.
Oja
.In
dependent
com
ponent
analy
sis
for
paralle
lfi-
nancia
ltim
eserie
s.
InIn
Pro
c.
Int.
Conf.
on
Neura
lIn
form
atio
nPro
cessin
g(IC
ON
IP),
volu
me
2,pages
895–898,1998.
E.G
.Learned-M
iller
and
J.W
.Fis
her
III.
ICA
usin
gspacin
gs
estim
ates
of
entropy.
JournalofM
achin
eLea
rnin
gResea
rch,4:1
271–1295,2003.
S.I.
Lee
and
S.Batzoglo
u.
Applic
atio
nof
independent
com
ponent
analy
sis
to
mic
roarrays.
Genom
eBio
logy,4(11):R
76,2003.
L.M
olg
edey
and
H.Schuster.Separatio
nofa
mix
ture
ofin
dependentsig
nals
usin
gtim
edela
yed
correla
tio
n.
Physica
lRevie
wLette
rs,
72(23):3
634–3637,
1994.
D.-T
.Pham
.Fast
alg
orit
hm
sfo
rm
utual
info
rm
atio
nbased
independent
com
ponent
analy
sis
.IE
EE
Tra
nsa
ctio
ns
on
Sig
nalPro
cessin
g,
52(10):2
690–
2700,2004.
H.Shen
and
K.H
uper.
New
ton-lik
em
ethods
for
paralle
lin
dependent
com
-ponent
analy
sis
.In
MLSP
16,pages
283–288,M
aynooth,Ir
ela
nd,2006.
H.Shen,S.Jegelk
a,and
A.G
retton.
Fast
kernelIC
Ausin
gan
approxim
ate
New
ton
method.
InA
ISTATS
11,pages
476–483.M
icrotom
e,2007.
H.Shen,S.Jegelk
a,and
A.G
retton.
Fast
kernel-b
ased
independent
com
po-
nent
analy
sis
.IE
EE
Tra
nsa
ctio
ns
on
Sig
nalPro
cessin
g,2009.
Inpress.
H.Stogbauer,A
.K
raskov,S.A
stakhov,and
P.G
rassberger.Leastdependent
com
ponent
analy
sis
based
on
mutual
info
rm
atio
n.
Phys.
Rev.
E,
70(6):
066123,2004.
42-2