dynamics of learning vq and neural gas aree witoelar, michael biehl mathematics and computing...

Dynamics of Learning VQand Neural Gas

Aree Witoelar, Michael BiehlMathematics and Computing ScienceUniversity of Groningen, Netherlands

in collaboration with Barbara Hammer (Clausthal), Anarta Ghosh (Groningen)

Dagstuhl Seminar, 25.03.2007

Outline

Vector Quantization (VQ)

Analysis of VQ Dynamics

Learning Vector Quantization (LVQ)

Summary


Vector Quantization

Objective:representation of (many) data with (few) prototype vectors

Assign data ξμ to nearest prototype vector wj

(by a distance measure, e.g. Euclidean)

grouping data into clusters e.g. for classification

P

j ),d(E(W)

wξ

data distance to nearest prototype

Find optimal set W for lowest quantization error


Example: Winner Takes All (WTA)

• initialize K prototype vectors

• present a single example

• identify the closest prototype, i.e the so-called winner • move the winner even closer towards the example

• stochastic gradient descent with respect to a cost function

• prototypes at areas with high density of data


Problems

Winner Takes All “winner takes most”: update according to “rank”e.g. Neural Gas

sensitive to initialization

less sensitive to initialization?


(L)VQ algorithms

• intuitive

• fast, powerful algorithms

• flexible

• limited theoretical background w.r.t. convergence speed, robustness to initial conditions, etc.

Analysis of VQ Dynamics

• exact mathematical description in very high dimensions

• study of typical learning behavior


Model: two Gaussian clusters of high dimensional data

Random vectors ξ ∈ ℝN according to

),(σ)P(

σ)P(p )P(

σσ

1σσ

Bξ

ξξ

Ν

prior prob.: p+, p-

p+ + p- = 1

B+

B-

(p+)

(p-)

ℓ

separable in projectionto (B+ , B-) plane

(p+)

(p-)

not separable on other planes

cluster centers: B+, B- ∈ ℝN

variance: υ+, υ-

separation ℓ

only separable in 2 dimensions simple model, but not trivial

classes: σ = {+1,-1}


sequence of independent random data P...,1,2,3,μ},{ μμ ξ

learning rate,step size

strength,direction ofupdate etc.

move prototypetowards currentdata

1-μs

μμsss

1-μs

μs ... ,σ,c,rankf

N

ηwξww

K,...,2,1rank

1σ,

s

sc

update of prototype vector

prototypeclass

data class

“winner”

fs […] describes the algorithm used

Online learning

ws ∈ ℝN


μt

μs

μstσ

μs

μsσ QBR www projections tocluster centers

length and overlapof prototypes

1. Define few characteristic quantities of the system

μstQ μ

sσR

1-μs

μμsss

1-μs

μs ... ,σ,c,rankf

N

ηwξww

μμμ1-μs

μs ξBbh ξwrandom vector ξμ enters as projections

1-μsσ

μσs

1-μsσ

μsσ Rb...fη)R(R N

N

N

/1Ο...f...fη

Qh...fηQh...fη)QQ(

ts2

1-μst

μst

1-μst

μts

1-μst

μst

2. Derive recursion relations of the quantities for new input data

1,1 σ

},...,2,1{t s,

K

3. Calculate average recursions


In the thermodynamic limit N∞ ...

• characteristic quantities • self average w.r.t. random sequence of data (fluctuations vanish)

μstQ μ

sσR

• the projections• become correlated Gaussian quantities completely specified in terms of first and second moments:

sσσs R h st σtσsσt s Q hh- hh

sσσsσ s R bh- bh

bb- bbσσσ

else0

σ if b

σ

• define continuous learning time N

μ t μ : discrete (1,2,…,P)

t : continuous


4. Derive ordinary differential equations

1-μsσ

μσs

sσ Rb...fηdR

dt

...f...fη

Qh...fηQh...fηQ

ts2

1-μst

μst

1-μst

μts

st

dt

d

5. Solve for Rsσ(t), Qst(t)• dynamics/asymptotic behavior (t ∞)• quantization/generalization error • sensitivity to initial conditions, learning rates, structure of data


Nt /

Q11

Q22

Q12

ResultsVQ 2 prototypes

Nt /

R1+

R2-

R2+R1-

1-μs

μμs

μ1-μs

μs dd

N

ηwξww

jsj

ws winner

Numerical integration of the ODEs

(ws(0)≈0 p+=0.6, ℓ=1.0, υ+=1.5, υ-

=1.0, =0.01)

E(W)

t

characteristic quantities

quantization error


B+

B-

ℓ

RS+

RS-

2 prototypes

Projections of prototypes on the B+,B- plane at t=50

RS+

RS-

p+ > p-

Two prototypes move to the stronger cluster

3 prototypes


Neural Gas: a winner take most algorithm3 prototypes

1-μs

μs1-μs

μs )

)(

r(exp

1

N

ηwξww

tC

update strength decreases exponentially by rank

RS+

RS-

quantizationerror E(W)

t

λi=2; λf=10-2

λ(t) large initially,decreased over time

λ(t)0: identical to WTA

t=0 t=50


Sensitivity to initialization

at t=50

t=0

Neural GasWTA

at t=50RS+

RS-

RS+

RS-

Neural Gas:• more robust w.r.t. initialization

WTA:• (eventually) reaches minimum E(W)• depends on initialization: possible large learning timeE(W)

t

“plateau”∇HVQ≈0


Learning Vector Quantization (LVQ)

Objective:classification of data using prototype vectors

Find optimal set W for lowest generalization error

else

cccg

0

1 ),(;),(

jjj

misclassified by nearest prototype

Assign data {ξ,σ}; ξ ∈ ℝN to nearest prototype vector(distance measure, e.g. Euclidean)

NRss

s w};c,{w


1-μs

μμs

μ1-μs

μs dd

N

ηwξww

jsj

sc

ws winner ±1

LVQ1

c={+1, -1}

RS+

RS-

two prototypes

c={+1,+1,-1}

RS+

RS-

three prototypes

c={+1,-1,-1}

RS+

RS-

which class to add the 3rd prototype?

update winner towards/ away from data

no cost function related to generalization error


Generalization error

p+=0.6, p-= 0.4υ+=1.5, υ-=1.0

εg

t

class

1

K K

g ),(ddp

s

ssjsj

ε

misclassified data


Optimal decision boundary

B+

B-

(p+>p- )

(p-)

ℓ

d

equal variance (υ+=υ-):linear decision boundary

unequal variance υ+>υ-

K=2

optimal with K=3

more prototypes better approximation to optimal decision boundary

(hyper)plane where

1)σP( p 1)σP( p ξξ


Asymptotic εg

p+

• Optimal: K=3 better• LVQ1: K=3 better

• best: more prototypes on the class with the larger variance

• more prototypes not always better for LVQ1

υ+ >υ- (υ+=0.81, υ- =0.25)

εg(t∞)

c={+1,+1,-1}

p+

εg

c={+1,-1,-1}

• Optimal: K=3 equal to K=2• LVQ1: K=3 worse

εg(t∞)


Summary

dynamics of (Learning) Vector Quantization for high dimensional data

Neural Gas: more robust w.r.t. initialization than WTA LVQ1: more prototypes not always better

Outlook

study different algorithms e.g. LVQ+/-, LFM, RSLVQ more complex models multi-prototype, multi-class problems

ReferenceDynamics and Generalization Ability of LVQ Algorithms M . Biehl, A . Ghosh, and B . Hammer Journal of Machine Learning Research (8 ): 323-360 (2007 )http://jmlr.csail.mit.edu/papers/v8/biehl07a.html

http://jmlr.csail.mit.edu/papers/v8/biehl07a.html

Questions?


Central Limit Theorem

•Let x1, x2,…, xN be independent random numbers from arbitrary probability distribution with mean and finite variance

•The distribution of the average of xj approaches a normal distribution as N becomes large.

N=1

N=2 N=5 N=50

Example: non-normal distribution

Distribution of average of xj:

p(xj)

N

1jjx

N

1p


Self Averaging

Monte Carlo simulations over 100 independent runs

Fluctuations decreases with larger degree of freedom N

At N∞, fluctuations vanish (variance becomes zero)


“LVQ +/-” t}s,{j;N

η 1-μj

μ1-μj

μj wξww jc

ds = min {dk} with cs = σμ

update correct and incorrect winners

dt = min {dk} with ct ≠σμ

t

t

strongly divergent!

p+ >> p- : strong repulsion by stronger class

to overcome divergence: e.g. early stopping (difficult in practice)

stop at εg(t)=εg,min

εg(t)


Comparison LVQ1 and LVQ +/-

υ+ = υ- =1.0

LVQ1 outperforms LVQ+/- with early stopping

c={+1,+1,-1}

p+

υ+ = 0.81, υ- =0.25

LVQ+/- with early stopping outperforms LVQ1 in a certain p+ interval

p+

LVQ+/- performance depends on initial conditions

dynamics of learning vq and neural gas aree witoelar, michael biehl mathematics and computing...

Documents