lecture ii mathematical preliminaries · [email protected] [email protected]. 2/58 faculty of...

1/58

Faculty of Mathematics, University of Zagreb, Graduate course 2011-2012.“Blind source separation and independent component analysis”

Lecture IILecture II

Mathematical preliminariesMathematical preliminaries

Ivica Kopriva

Ruđer Bošković Institute

e-mail:[email protected]

[email protected]

2/58


Motivation with illustration of applications (lecture I) Mathematical preliminaries with principal component

analysis (PCA)? (lecture II)Independent component analysis (ICA) for linear static

problems: information-theoretic approaches (lecture III)ICA for linear static problems: algebraic approaches

(lecture IV)ICA for linear static problems with noise (lecture V)Dependent component analysis (DCA) (lecture VI)

Course outline

3/58


Course outlineUnderdetermined blind source separation (BSS) and

sparse component analysis (SCA) (lecture VII/VIII)Nonnegative matrix factorization (NMF) for determined

and underdetermined BSS problems (lecture VIII/IX)BSS from linear convolutive (dynamic) mixtures (lecture

X/XI)Nonlinear BSS (lecture XI/XII)Tensor factorization (TF): BSS of multidimensional

sources and feature extraction (lecture XIII/XIV)

4/58


Mathematical preliminaries

Expectations, moments, cumulants, kurtosisGradient and optimization: vector gradient, Jacobian, matrix gradient –natural and relative gradient, gradient descent, stochastic gradient descent (batch and online)Estimation theoryInformation theory: entropy, differential entropy, entropy of a transformation, mutual information, distributions with maximum entropy, negentropy, entropy approximation: Gram-Charlier expansionPCA and whitening (batch and online)

5/58


Mathematical preliminariesThe rth order moment of the scalar stochastic process z is defined as

1

1( ) ( ) ( )Tr r rr t

M z E z z f z dz z tT

∞

=−∞

⎡ ⎤= = ≅⎣ ⎦ ∑∫where f(z) is pdf of random variable z and T represents sample size. For real stochastic processes z1 and z2 the cross-moment is defined as

1 2 1 2 1 21

1( , ) ( ) ( )Tr p r prp t

M z z E z z z t z tT =

⎡ ⎤= ≅⎣ ⎦ ∑Cross-moment can be defined for nonzero time shifts as

,

, , , ,..1 2 1 1 2 1

1 2 11

( , ,..., ) ( ) ( ) ( ).... ( )

1 ( ) ( ) ( ).... ( )

r z

i j k lr i j k l r

Ti j k l rt

M E z t z t z t z t

z t z t z t z tT

τ τ τ τ τ τ

τ τ τ

− −

−=

⎡ ⎤= + + +⎣ ⎦

≅ + + +∑

6/58


Mathematical preliminariesThe first characteristic function of the random process z is defined as

( ) ( )j z j zz E e e f z dzω ωω

∞

−∞⎡ ⎤Ψ = =⎣ ⎦ ∫

Power series expansion of ψz(ω) is obtained as

[ ]

22

22

( ) ( )( ) 1 ... ...2! !

( ) ( )1 ... ...2! !

nn

z

nn

j jE j z z zn

j jE z j E z E zn

ω ωω ω

ω ωω

⎡ ⎤Ψ = + + + + +⎢ ⎥

⎣ ⎦

⎡ ⎤ ⎡ ⎤= + + + + +⎣ ⎦ ⎣ ⎦

Moments of the random process z are defined as

( )1 at 0.n

n zn n

dE zj d

ω ωω

⎡ ⎤Ψ⎡ ⎤ = =⎢ ⎥⎣ ⎦⎣ ⎦

7/58


Mathematical preliminariesThe second characteristic function of the random process z is defined as

( ) ln ( )z zK ω ω= Ψ

Power series expansion of Kz(ω) is obtained as2

1 2( ) ( )( ) ( )( ) ( ) ... ( ) ...

2! !

n

z nj jK C z j C z C z

nω ωω ω= + + + +

where C1,C2,.., Cn are cumulants of the appropriate order and are defined as

( )1( ) at 0.n

zn n n

d KC zj d

ω ωω

⎡ ⎤= =⎢ ⎥

⎣ ⎦

8/58


In practice cumulants, as well as moments, must be estimated from data. For that purpose relations between cumulants and moments are important. It follows from the definition of the first and second characteristic function


{ }exp ( ) ( )z zK ω ω= Ψ

After Taylor series expansion the following is obtained

{ } [ ]

[ ]

2

1 2

22

( ) ( )exp ( ) exp ( )( ) exp ( ) .....exp ( ) .....2! !

( ) ( )1 ... ...2! !

n

z n

nn

j jK C z j C z C zn

j jE z j E z E zn

ω ωω ω

ω ωω

⎡ ⎤ ⎡ ⎤= ⎢ ⎥ ⎢ ⎥

⎣ ⎦ ⎣ ⎦

⎡ ⎤ ⎡ ⎤= + + + + +⎣ ⎦ ⎣ ⎦

After exponential function is Taylor series expanded relations between cumulants and moments are obtained by equating coefficients with the same power of jω .

9/58


Mathematical preliminariesRelation between moments and cumulants:1-3

12 2

2 13 3

3 2 1 14 2 2 4

4 3 1 2 2 1 1

[ ] ( )

[ ] ( ) ( )

[ ] ( ) 3 ( ) ( ) ( )

[ ] ( ) 4 ( ) ( ) 3 ( ) 6 ( ) ( ) ( )

E z C z

E z C z C z

E z C z C z C z C z

E z C z C z C z C z C z C z C z

=

= +

= + +

= + + + +

For zero mean process relations simplify to:

22

33

4 24 2

[ ] ( )

[ ] ( )

[ ] ( ) 3 ( )

E z C z

E z C z

E z C z C z

=

=

= +

1K.S. Shanmugan, A.M. Breiphol, Random Signals – Detection, Estimation and Data Analysis, J.Wiley, 1988.2D.R. Brillinger, Time Series Data Analysis and Theory, McGraw-Hill, 1981.3P. McCullagh, Tensor Methods in Statistics, Chapman & Hall, 1995.

10/58


Mathematical preliminariesRelations between cumulants and moments follow from previous expressions as

12 2

23 2 3

3

4 3 2 2 2 2 44

( ) [ ]

( ) [ ] [ ]

( ) [ ] 3 [ ] [ ] 2 [ ]

( ) [ ] 4 [ ] [ ] 3 [ ] 12 [ ] [ ] 6 [ ]

C z E z

C z E z E z

C z E z E z E z E z

C z E z E z E z E z E z E z E z

=

= −

= − +

= − − + −

and for zero mean processes relations simplify to

22

33

4 2 24

( ) [ ]

( ) [ ]

( ) [ ] 3 [ ]

C z E z

C z E z

C z E z E z

=

=

= −

11/58


Mathematical preliminariesAs it is the case with cross-moments it is possible to define the cross-cumulants. They are useful in measuring statistical dependence between the random processes. For zero mean processes second order cross-cumulants are defined as

( , ) [ ( ) ( )]z i jC i j E z t z t τ= +

third order cross-cumulants are defined as

1 2( , , ) ( ) ( ) ( )z i j kC i j k E z t z t z tτ τ⎡ ⎤= + +⎣ ⎦

several third order cross-cumulants could be computed such as for example12

1 2( , )i jz zC τ τ111

1 2( , )i j kz z zC τ τ 21

1 2( , )i jz zC τ τ 3

1 2( , )izC τ τ

12/58


Mathematical preliminariesfourth order cross-cumulants are defined as

[ ][ ][ ]

1 2 3

1 2 3

1 3 2

2 3 1

3 1 2

( , , , ) ( ), ( ), ( ), ( )

( ), ( ), ( ), ( )

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

z i j k l

i j k l

i j l k

i k l j

i l j k

C i j k l cum z t z t z t z t

E z t z t z t z t

E z t z t E z t z t

E z t z t E z t z t

E z t z t E z t z t

τ τ τ

τ τ τ

τ τ τ

τ τ τ

τ τ τ

⎡ ⎤= + + +⎣ ⎦⎡ ⎤= + + +⎣ ⎦⎡ ⎤− + + +⎣ ⎦

⎡ ⎤− + + +⎣ ⎦⎡ ⎤− + + +⎣ ⎦

Several fourth order cross-cumulants could be computed such as for example4

1 2 3( , , )izC τ τ τ13

1 2 3( , , )i jz zC τ τ τ 31

1 2 3( , , )i jz zC τ τ τ22

1 2 3( , , )i jz zC τ τ τ1111

1 2 3( , , )i j k lz z z zC τ τ τ

13/58


Mathematical preliminariesIf two random processes z1(t) and z2(t) are mutually independent then

[ ]1 2 1 2( ) ( ) [ ( )] [ ( )]E g z h z E g z E h z=

Proof:

[ ]

[ ] [ ]

1 2

1 2

1 2 1 2 1 2 1 2

1 1 1 2 2 2

1 2

( ) ( ) ( ) ( ) ( , )

( ) ( ) ( ) ( )

( ) ( )

z z

z z

E g z h z g z h z p z z dz dz

g z p z dz h z p z dz

E g z E h z

∞ ∞

−∞ −∞

∞ ∞

−∞ −∞

=

=

=

∫ ∫∫ ∫

14/58


Mathematical preliminariesOperator properties of the cumulants. Cumulants have certain properties thatmake them very useful in various computations with stochastic processes. Most of these properties will be stated without proofs which can be found in2,4.

[CP1] If are constants and are random variables then{ } 1

ni i

α= { } 1

ni i

z=

( ) ( )1 1 2 2 1 21

, ,..., , ,...,n

n n i ni

cum z z z cum z z zα α α α=

⎛ ⎞=⎜ ⎟⎝ ⎠∏

[CP2]- Cumulants are additive in their arguments

( )( ) ( )

1 1 2

1 2 1 2

, ,..,

, ,..., , ,...,n

n n

cum x y z z

cum x z z cum y z z

+

= +

4J. Mendel, “Tutorial on Higher-Order Statistics (Spectra) in Signal Processing and System Theory: Theoretical Results and Some Applications,” Proc. IEEE, vol. 79, no.3, March 1991, pp.278-305.

15/58


[CP3] – Cumulants are symmetric in their arguments i.e.


( ) ( )1 21 2, ,..., , ,...,kk i i icum z z z cum z z z=

where (i1,i2,…,ik ) is any permutation of (1,2,…,k).

[CP4] – If α is constant then

( ) ( )1 2 1 2, ,..., , ,...,k kcum z z z cum z z zα + =

[CP5] – If random processes (z1,z2,…,zk) and (y1,y2,…,yk) are independent then

( ) ( ) ( )1 1 2 2 1 2 1 2, ,..., , ,..., , ,...,k k k kcum y z y z y z cum y y y cum z z z+ + + = +

16/58


Mathematical preliminaries[CP6] – If any subset of k random variables is statistically independent from the rest then{ } 1

ki i

z=

( )1 2, ,..., 0kcum z z z =

[CP7] – Cumulants of independent identically distributed (i.i.d.) random processes are delta functions i.e.

1 2 1, 1 2 1 , ...( , ,..., )kk z k k zC τ τ ττ τ τ γ δ−− =

where , , .k z k zCγ =

17/58


Mathematical preliminaries[CP8] - If random process is Gaussian N(µ,σ2) then Ck,z=0 for k>2.

Proof. Let f(z) be p.d.f. of the normally distributed random process z

2

2

1 ( )( ) exp22

zf z µσσ π

⎧ ⎫−= −⎨ ⎬

⎩ ⎭

where µ=E(z)=C1(z) is the mean value or the first order cumulant of the random process z and σ2=E[z2]-E2[z]=C2(z) is the variance or the second order cumulant of the random process z.

According to the definition of the first characteristic function

( ) ( )2 2

2

=1

( ) ( )

ˆ1exp exp2 22

j z j zz E e f z e dz

j zj dz

ω ωω

σ ω µµω

σσ π

∞

−∞

∞

−∞

⎡ ⎤Ψ = = =⎣ ⎦

⎧ ⎫ ⎧ ⎫−⎪ ⎪ ⎪ ⎪= + × −⎨ ⎬ ⎨ ⎬⎪ ⎪ ⎪ ⎪⎩ ⎭ ⎩ ⎭

∫

∫

18/58


Mathematical preliminarieswhere . By definition of the second characteristic function 2ˆ jµ µ σ ω= +

22 ( )( ) ln ( ) ( )

2z zjK j ωω ω µ ω σ= Ψ = +

when compared with Taylor series expansion of Kz(ω) it follows that Ck(z)=0 for k>2.

The complexity for the cumulants of the order higher than 4 is very high. An important question that has to be answered from the application point of view is what statistics of the order higher than 2 should be used? Because symmetrically distributed random processes, which occur very often in engineering applications, have odd order statistics equal to zero the third order statistics are not used. So, the first statistics of the order higher than 2 that are used are the fourth order statistics.

19/58


Mathematical preliminariesAnother natural question that should be answered is: why and when cumulants are better than moments?

• Computation with cumulants is simpler than by using moments.

• For independent random variables cumulant of the sum equals the sum of the cumulants.

• Cross-cumulants of the independent random processes are zero what does not apply for moments.

• For normally distributed processes cumulants of order greater than 2 vanish what does not apply for moments.

• Blind identification of the LTI systems (MA, AR and ARMA) is possible to formulate usingcumulants provided that input signal is non-Gaussian i.i.d. random process.

20/58


Mathematical preliminariesDensity of transformation. Assume that both x and y are n-dimensional random vectors related through the vector mapping y=g(x). It is assumed that inverse mapping x=g-1(y) exists and is unique. Density py(y) is obtained from density px(x) as follows5

( )( )1( ) ( )

detp p

J −=y x1

y xg g y

Where Jg is the Jacobian matrix representing the gradient of the vector-valued function

1 2

1 1 1

1 2

2 2 2

1 2

( )( ) ( ) ...

( )( ) ( ) ...( )

... ... ... ...( )( ) ( ) ...

n

n

n

n n n

gg gx x x

gg gx x xJ

gg gx x x

∂∂ ∂⎡ ⎤⎢ ⎥∂ ∂ ∂⎢ ⎥

∂⎢∂ ∂ ⎥⎢ ⎥∂ ∂ ∂= ⎢ ⎥⎢ ⎥⎢ ⎥∂∂ ∂⎢ ⎥⎢ ⎥∂ ∂ ∂⎣ ⎦

xx x

xx xg x

xx x

For linear nonsingular transformation y=Ax the relation between densities becomes 1( ) ( )

detp p=y xy x

A5A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw-Hill, 3rd ed., 1991.

21/58


Mathematical preliminariesKurtosis. Based on the measure called kurtosis it is possible to measure how far certain stochastic process is from Gaussian distribution. The kurtosis is defined as

422

( )( )( )

ii

i

C zzC z

κ =

where C4(zi) is the FO and C2(zi) is the SO cumulant of the signal zi. For the zero mean signals kurtosis becomes

4

2 2

[ ]( ) 3[ ]

ii

i

E zzE z

κ = −

22/58


Mathematical preliminariesAccording to cumulant property [CP8] the FO cumulant of the Gaussian process is zero. Signals with positive kurtosis are classified as super-Gaussian and signals with negative kurtosis as sub-Gaussian. Most of the communication signals are sub-Gaussian processes. Speech and music signals are super-Gaussian processes. Uniformly distributed noise is sub-Gaussian process. Impulsive noise (with Laplacian or Lorentz distribution) is highly super-Gaussian process.

23/58


Mathematical preliminariesGradients and optimization (learning) methods.

• Gradients are used in minimization or maximization of the cost functions. This process is called learning.

• Solution of the ICA problem is also obtained as a solution of the minimization or maximization problem of the scalar cost function w.r.t. matrix argument.

• Scalar function g with vector argument is defined as g=g(w1,w2,…,wm)=g(w). Vector gradient w.r.t. w is defined as

1

...

m

gw

g

gw

∂⎛ ⎞⎜ ⎟∂⎜ ⎟∂ ⎜ ⎟=

∂ ⎜ ⎟∂⎜ ⎟⎜ ⎟∂⎝ ⎠

w

Another commonly used notation for vector gradient is ∇g or ∇wg .

24/58



Vector valued function is defined as

1g ( )( ) ...

( )ng

⎛ ⎞⎜ ⎟= ⎜ ⎟⎜ ⎟⎝ ⎠

wg w

w

Gradient of the vector valued function is called Jacobian matrix of g and is defined as

1 2

1 1 1

1 2

2 2 2

1 2

( )( ) ( ) ...

( )( ) ( ) ...( )

... ... ... ...( )( ) ( ) ...

n

n

n

m m m

gg gw w w

gg gw w wJ

gg gw w w

∂∂ ∂⎡ ⎤⎢ ⎥∂ ∂ ∂⎢ ⎥

∂⎢∂ ∂ ⎥⎢ ⎥∂ ∂ ∂= ⎢ ⎥⎢ ⎥⎢ ⎥∂∂ ∂⎢ ⎥⎢ ⎥∂ ∂ ∂⎣ ⎦

ww w

ww wg w

ww w

25/58



In solving ICA problem a scalar-valued function g with matrix argument W is defined as g=g(W)=g(w11,…,wij,…,wmn).

The matrix gradient is defined as

11 1

1

...

... ... ...

...

n

m mn

g gw w

g

g gw w

∂ ∂⎛ ⎞⎜ ⎟∂ ∂⎜ ⎟∂ ⎜ ⎟=⎜ ⎟∂ ∂⎜ ⎟⎜ ⎟∂ ∂⎝ ⎠

W

26/58


Mathematical preliminariesExample – matrix gradient of the determinant of a matrix (necessary in some ICA algorithms).

For invertible square matrix W gradient of the determinant of a matrix is obtained as

( ) 1det det

−∂= TW W W

WProof. The following definition for inverse matrix is used

1 1 adj( )det

− =W WW

where adj(W) stands for adjoint of W and is defined as

11 1

1

...adj( )

...n

n nn

W WW W⎛ ⎞

= ⎜ ⎟⎝ ⎠

W

and scalars Wij are called cofactors.

27/58


Mathematical preliminariesDeterminant can be expressed in terms of cofactors as

detn

ik ikk=1

w W=∑W

It follows

( ) ( )( ) 1Tdet det adj detTij

ij

Ww

−∂ ∂= → = =

∂ ∂W W W W W

W

The previous result also implies

( ) 1log det det1det

T

ijw−∂ ∂

= =∂ ∂

W WW

W W

28/58


Mathematical preliminariesNatural or relative gradient.6,7 Parameter space of the square matrices has Riemannian metric structure and Euclidean gradient does not point into steepest direction of the cost function J(W).It has to be corrected in order to obtain natural or relative gradient i.e. i.e. correction term is given by WTW.

( )J∂ ∂ TW W W

6S. Amari, “Natural gradient works efficiently in learning,” Neural Computation 10(2), pp. 251-276, 1998.7J. F. Cardoso, and B. Laheld, “Equivariant adaptive source separation,” IEEE Trans. Signal Processing 44(12),pp. 3017-3030, 1996.

29/58


Mathematical preliminariesProof. We shall follow derivation given by Cardoso in7. Let us write the Taylor series of J(W + δW):

( ) ( ) trace ...TJJ Jδ δ

⎡ ⎤∂⎛ ⎞+ = + +⎢ ⎥⎜ ⎟∂⎝ ⎠⎢ ⎥⎣ ⎦W W W W

W

Let us require that displacement δW is always proportional to W itself, δW = DW. Then

( ) ( ) trace ...

( ) trace ...

T

T

JJ J

JJ

⎡ ⎤∂⎛ ⎞+ = + +⎢ ⎥⎜ ⎟∂⎝ ⎠⎢ ⎥⎣ ⎦⎡ ⎤∂⎛ ⎞= + +⎢ ⎥⎜ ⎟∂⎝ ⎠⎢ ⎥⎣ ⎦

W DW W DWW

W DWW

30/58


Mathematical preliminariesUsing the property of the trace operator trace(M1M2)= trace(M2M1) we can write

trace traceT T

T TJ J⎡ ⎤ ⎡ ⎤∂ ∂⎛ ⎞ ⎛ ⎞=⎢ ⎥ ⎢ ⎥⎜ ⎟ ⎜ ⎟∂ ∂⎝ ⎠ ⎝ ⎠⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦D W W D

W W

We seek for the largest decrement in the value of J(W+DW)-J(W) that is obtained when term is minimized. That will happen when D is proportional to . Because

δW=DW we obtain the gradient descent learning rule

traceT

TJ⎡ ⎤∂⎛ ⎞⎢ ⎥⎜ ⎟∂⎝ ⎠⎢ ⎥⎣ ⎦

W DW

TJ∂−∂

WW

T TJ J∂ ∂⎛ ⎞∆ ∝ − = −⎜ ⎟∂ ∂⎝ ⎠W W W W W

W W

implying that correction term is WTW.

31/58


Mathematical preliminariesExercise. Let the scalar function with matrix argument be defined as f(A)=(y-Ax)T (y-Ax) where y and x are vectors.

Euclidean gradient is given with: ∂f/∂A=-2(y-Ax)xT.

Riemannian gradient is given with (∂f/∂A)ATA==-2(y-Ax)xTATA.

Euclidean gradient based learning equation:A(k+1)=A(k)-µ(∂f/∂A)

Riemannian gradient based learning equation:A(k+1)=A(k)-µ(∂f/∂A) ATA

Evaluate convergence properties of these two learning equations starting with the initial condition A0=I. Assume the true values as: A=[2 1;1 2], x=[1 1]T, y=Ax=[3 3]T.

32/58


Mathematical preliminariesGradient descent and stochastic gradient descent. ICA like in many other statistical and neural network techniques is data driven i.e. we need observation data (x) in order to solve the problem. Typical cost function has the form

( ){ }( ) ,J E g=W W x

where expectation is defined w.r.t. some unknown density f(x). The steepest descent learning rule becomes

{ }( 1)

( ) ( 1) ( ) ( , ( ))t

t t t E g tµ= −

∂= − −

∂ W W

W W W xW

In practice expectation has to be approximated by the sample mean of function g over the sample x(1), x(2), …, x(T). The algorithm where entire training set is used at every iteration step is called batch learning.

33/58


Mathematical preliminariesGradient descent and stochastic gradient descent. Sometimes not all the observation could be known in advance but only the latest x(t) could be used. This is called on-line or adaptivelearning. The on-line learning rule is obtained by dropping expectation operator from the batch learning i.e.

( 1)

( ) ( 1) ( ) ( , )t

t t t gµ= −

∂= − −

∂ W W

W W W xW

Parameter µ(t) is called the learning gain and regulates smoothness and speed of convergence. On-line and batch algorithms converge toward the same solution that represents fixed point ofthe first order autonomous differential equation

{ }( , )d E gdt

∂= −

∂W W x

WThe learning rate µ(t) should be chosen as

( )tt

βµβ

=+

where β is an appropriate constant (e.g. β =100).

34/58


Mathematical preliminariesInformation theory – entropy, negentropy and mutual information.

Entropy of a random variable can be interpreted as the degree of information that the observation of the variable gives. For discrete random variable X entropy H is defined as

( )( ) ( ) log ( ) ( )i i ii i

H X P X a P X a f P X a= − = = = =∑ ∑

Entropy could be generalized for continuous random variables and vectors in which case it is called the differential entropy.

[ ]( ) ( ) log ( ) log ( )x x xH x p p d E p xξ ξ ξ= − = −∫

Unlike discrete entropy, differential entropy can be negative. Entropy is small when variable is not very random i.e. when it is contained in some limited intervals with high probabilities.

35/58


Mathematical preliminariesMaximum entropy distributions – finite support case.

Uniform distribution in the interval [0,a]. Density is given with

1/ 0( )

0x

a for ap

otherwiseξ

ξ≤ ≤⎧

= ⎨⎩

Differential entropy is evaluated as

0

1 1( ) log loga

H x d aa a

ξ= − =∫Entropy is small (in a sense of negative numbers) or large in absolute value sense when a is smallIn limit

0lim ( )a

H x→

= −∞

Uniform distribution is the maximum entropy distribution among the finite support distributions.

36/58


Mathematical preliminariesMaximum entropy distributions – infinite support case.

What is distributions that is compatible with the measurements (ci=E{Fi(x)}) and makes minimum number of assumptions on the data maximum entropy distributions ?

( ) ( ) 1,...,iip F d c i mξ ξ ξ = =∫

It has been shown in5 that density p0(ξ) which satisfies previous constraint and has maximum entropy is of the form

0 ( ) exp ( )ii

i

p A a Fξ ξ⎛ ⎞= ⎜ ⎟⎝ ⎠∑

For set of random variables that take all the values on the real line and have zero mean and fixed variance, say 1, maximum entropy distribution takes the form

( )20 1 2( ) expp A a aξ ξ ξ= +

that is recognized as Gaussian variable. It has the largest entropy among all random variables of finite (unit) variance with infinite support.5A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw-Hill, 3rd ed., 1991.

37/58


Mathematical preliminariesEntropy of a transformation. Consider an invertible transformation of the random vector x

( )=y f xRelation between densities is given by

11 1( ) ( ( )) det ( ( ))y xp p J−− −=η f η f f η

Expressing entropy as expectation we obtain{ } { }

{ } { }{ }

1( ) log ( ) log ( ) det ( )

log det ( ) log ( )

( ) log det ( )

y x

x

H E p E p J

E J E p

H E J

−⎡ ⎤= − = − ⎣ ⎦

= −

= +

y y x f x

f x x

x f xEntropy is increased by transformation. For the linear transform y=Ax we obtain

( ) ( ) log detH H= +y x A

Differential entropy is not scale invariant i.e. H(αx)=H(x) + log| α|. Scale must be fixed before differential entropy is measured.

38/58


Mathematical preliminariesMutual information is a measure of the information that members of a set of random variables have on the other random variables in the set (vector). One measure of mutual information is a distance between components of the set (vector) measured by Kullback-Leibler divergence:

1

1

1

( )( ) D ( ), ( ) ( ) log( )

( ) ( )

N

i Ni

ii

Nii

pMI p p y p dp y

H y H

=

=

=

⎛ ⎞= =⎜ ⎟

⎝ ⎠

= −

∏ ∫∏

∑

yy y y y

y

Using Jensen’s inequality8 it can be shown MI(y)≥0.

1

( ) 0 ( ) ( )N

ii

MI p p y=

= ⇔ =∏y y

Mutual information can be used as a measure of statistical (in)dependence.

8T. M. Cover and J.A. Thomas, Elements of Information Theory, Wiley, 1991.

39/58


Mathematical preliminariesNegentropy. The fact that Gaussian variable has maximum entropy enables to use entropy asa measure of non-Gaussianity. A measure called negentropy is defined to be zero for Gaussian variable and positive otherwise:

( ) ( ) ( )gaussJ H H= −x x xWhere xgauss is a Gaussian random vector of the same covariance matrix Σ as x. Entropy of Gaussian random vector can be evaluated as

[ ]1( ) log det 1 log 22 2gauss

NH π= + +x ΣNegentropy has an additional property to be invariant for invertible linear transformation y=Ax

( ) [ ] ( )

[ ]

[ ]

1( ) log det 1 log 2 ( ) log2 21 1log det 2 log det 1 log 2 ( ) log det2 2 21 log det 1 log 2 ( )2 2

( ) ( ) ( )gauss

NJ H

N H

N H

H H J

π

π

π

= + + − +

= + + + − −

= + + −

= − =

TAx AΣA x A

Σ A x A

Σ x

x x x

40/58


Mathematical preliminariesApproximation of entropy by cumulants. Entropy and negentropy are important measuresof non-Gaussianity and thus very important for the ICA. However, computation of entropy involves evaluation of integral that include density function. This is computationally verydifficult and would limit the use of entropy/negentropy in the algorithm design.

To obtain computationally tractable algorithms density function used in entropy integral is approximated using either Edgeworth or Gram-Charlier expansion which are Taylor series-like expansions around Gaussian density.

Gram-Charlie expansion of the standardized random variable x with non-Gaussian density is obtained as

3 43 4

( ) ( )ˆ( ) ( ) ( ) 1 ( ) ( ) ...3! 4!x x

H Hp p C x C xξ ξξ ξ ϕ ξ ⎛ ⎞≈ = + + +⎜ ⎟⎝ ⎠

where C3(x) and C4(x) are cumulants of order 3 and 4, ϕ(ξ) is Gaussian density and H3(ξ), H4(ξ) are Hermite polynomials defined as

1 ( )( ) ( 1)( )

ii

i iH ϕ ξξϕ ξ ξ

∂= −

∂

41/58


Mathematical preliminariesEntropy can now be estimated from

ˆ ˆ( ) ( ) log ( )x xH x p p dξ ξ ξ≈ −∫Assuming small deviation from Gaussianity we obtain

3 43 4

23 4 3 4

3 4 3 4

2 23 4

( ) ( )( ) ( ) 1 ( ) ( )3! 4!

( ) ( ) ( ) ( )log ( ) ( ) ( ) ( ) ( ) 23! 4!3! 4!

( ) ( )( ) log ( )2 3! 2 4!

H HH x C x C x d

H H H HC x C x C x C x d

C x C xd

ξ ξϕ ξ ξ

ξ ξ ξ ξϕ ξ ξ

ϕ ξ ϕ ξ ξ

⎛ ⎞≈ − + +⎜ ⎟⎝ ⎠

⎡ ⎤⎛ ⎞+ + − +⎢ ⎥⎜ ⎟⎝ ⎠⎢ ⎥⎣ ⎦

≈− − −× ×

∫

∫and computationally simple approximation of negentropy is obtained as

{ }23 21 1( ) ( )12 48

J x E x kurt x≈ +

42/58


Mathematical preliminariesPrincipal Component Analysis (PCA) and whitening (batch and online). PCA is basicallydecorellation based transform used in multivariate data analysis. In connection with ICA it is very often a useful preprocessing step used in the whitening transformation after which multivariate data become uncorrelated with unit variance.PCA and/or whitening transform are designed on the basis of the eigenvector decomposition of the sample data covariance matrix

( ) T1

1 ( ) ( )K

kK k k

=≈ ∑xxR x x

It is assumed data x is zero mean. If not this is achieved by x ← x- E{x}. Eigendecompositionof Rxx is obtained as

T=xxR EΛE

Where E is matrix of eigenvectors and Λ is diagonal matrix of eigenvalues of Rxx.

43/58


Mathematical preliminariesBatch form of PCA/whitening transform is obtained as

1/ 2 T−= =z Vx Λ E xand one verifies

T 1/ 2 T T 1/ 2 1/ 2 T T 1/ 2

1/ 2 T T 1/ 2 1/ 2 1/ 2

E E E− − − −

− − − −

⎡ ⎤ ⎡ ⎤ ⎡ ⎤= =⎣ ⎦ ⎣ ⎦ ⎣ ⎦= = =

I I

zz Λ E xx EΛ Λ E xx EΛ

Λ E EΛE EΛ Λ ΛΛ I

an adaptive whitening transform is obtained as

1

k k k

Tk k k k k kµ+

=

⎡ ⎤= − −⎣ ⎦

z V x

V V z z I V

where µ is small learning gain and V0=I.

44/58


Mathematical preliminariesScatter plots of two uncorrelated Gaussian signals (left); two correlated signals obtained as linearcombinations of the uncorrelated Gaussian signals (center); two signals after PCA/whitening transform (right).

y=Λ-1/2ETzz=[z1;z2]

x1=N(0,4); x2 = N(0,9) z1=x1 + x2z2=x1 + 2x2

45/58


s1 s2

x1 x2


x1 = 2s1 + s2x2 = s1 + s2

y1≅s1 (?)y2≅s2 (?)

46/58


What is ICA ?Imagine situation in which two microphones recording weighted sums of the two signals emitted by the speaker and background noise.

x1 = a11s1 + a12s2x2 = a21s1 + a22s2

The problems is to estimated the speech signal (s1) and noise signal (s2) from observations x1 and x2.

If mixing coefficients a11, a12, a21 and a22 are known problem would be solvable by simple matrix inversion.

ICA enables to estimated speech signal (s1) and noise signal (s2) without knowing the mixing coefficients a11, a12, a21 and a22. This is why the problem of recovering source signals s1 and s2 is called blind source separation problem.

47/58


What is ICA ?9

ICA is the most frequently used method to solve the BSS problem.ICA could be considered as theory for multichannel blind signal recovery requiring minimum of a priori information.

Problem:x=As or x=A*s or x=f(As) or x=f(A*s)

Goal: find s based on x only (A is unknown).

W can be found such that:

y≅s=Wx or y≅s=W*x → y ≅ PΛs

9Ch. Jutten, J. Herault, “Blind separation of sources, part I: An adaptive algorithm based onneuromimetic architecture”, Sig. Proc., 24(1):1-10, 1991.

48/58


What is ICA?

..

....

SensorsSensors

s1

s2

sN

Sources

Environment

Source Separation Algorithm

W...

..x2

xN

x1 y 1

y2

yN

Observations

x = As y = Wx

49/58


Speech from noise separations x y

50/58


source signals si(t) must be statistically independent.

source signals si(t), except one, must be non-Gaussian.

mixing matrix A must be nonsingular and square.

Knowledge of the physical interpretation of the mixing matrix is of crucial importance.

When does ICA work !?

( )p ( )N

i ii

p s=

=∏s

( ) 0 2n iC s n≠ >

−≅ 1W A

51/58


When does ICA work !?Ambiguities of ICA.a) Variances (energies) of the independent components can not be determined. This is called scaling indeterminacy. The reason is that both s and A being unknownany scalar multiplier in one of the sources can always be canceled by dividing the corresponding column of A by the same multiplier:

( )1i i ii

i

sαα⎛ ⎞

= ⎜ ⎟⎝ ⎠

∑x a

b) Order of the independent components can not be determined. This is called permutation indeterminacy. The reason is that components of the source vector s andcolumns of the mixing matrix A could be freely changed in such that

x=AP-1Pswhere P permutation matrix, Ps is new source vector with original components but in different order and AP-1 is a new unknown mixing matrix.

52/58


When does ICA work !?To illustrate ICA model in statistical terms we consider two independent uniformlydistribute signals s1 and s2. Scatter plot on the left picture shows that there is no redundancy between them i.e. no knowledge could be gained about s2 from s1.Right picture shows mixed signals obtained according to x=As where A=[5 10;10 2].Obviously there is dependence between x1 and x2. Knowing maximum or minimum of x1 enables to know x2 A could be estimated. But what if sources signals have different statistics?

Source signals Mixed signals

53/58


When does ICA work !?Consider two independent source signals s1 and s2 generated by the truck and tank engines. Scatter plot on the left picture shows that there is again no redundancy between them i.e. no knowledge could be gained about s2 from s1. But distribution is different i.e. sparse. Right picture shows mixed signals obtained according to x=As where A=[5 10;10 2]. Obviously there is dependence between x1 and x2. But edges are at different positions this time. Estimated A from the scatter plot would be very unreliable. ICA can do that for different signal statistics without knowing that in advance.

Mixed signalsSource signals

54/58


When does ICA work !?Whitening is only half of the ICA. Whitening transform decorrelates signals. If signals are non-Gaussian it does not make them independent. Whitening transform is usually usefulfirst processing step in ICA. A second rotation stage achieved by an unitary matrix can be obtained by ICA exploiting non-Gaussianity of the signals.

Source signals Mixed signals Whitened signals

55/58


When does ICA work !?Why Gaussian variables are forbidden? Suppose two independent components s1 and s2 areGaussian. Their joint density is given by

22 21 2

1 21 1( , ) exp exp

2 2 2 2s sp s s

π π

⎛ ⎞⎛ ⎞+= − = ⎜− ⎟⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

s

Assume that source signals are mixed with orthogonal mixing matrix i.e. A-1=AT. Their jointdensity is given by

2T2 2T T

1 2

2

1( , ) exp det det 12 2

1 exp2 2

p x x andπ

π

⎛ ⎞⎜ ⎟= − = = =⎜ ⎟⎝ ⎠

⎛ ⎞= ⎜− ⎟

⎜ ⎟⎝ ⎠

A xA A x x A

x

Original and mixing distributions are identical there is no way that we could infer mixingmatrix from the mixtures.

56/58


When does ICA work !?Why Gaussian variables are forbidden? Consider mixture of two Gaussian signals s1=N(0,4) and s2=N(0,9). After whitening stage scatter plot is the same as for source signals. There is no redundancy left after the whitening stage nothing for ICA to do.

Source signals Mixed signals Whitened signals

57/58


When does ICA work !?PCA applied to blind image separation:

z1 z2

MATLAB code: Rx=cov(X’); % estimate of the data covariance matrix[E,D] = eig(Rx); % eigen-decomposition of the data covariance matrixZ = E’*X; % PCA transformz1=reshape(Z(1,:),P,Q); % transforming vector into imagefigure(1); imagesc(z1); % show first PCA imagez2=reshape(Z(2,:),P,Q); % transforming vector into imagefigure(2); imagesc(z2); % show second PCA image

58/58


Histograms of source, mixed and PCAextracted images

Source image Mixed images PCA extracted images

lecture ii mathematical preliminaries · [email protected] [email protected]. 2/58 faculty of...

Documents