inverse regression approach to (robust) non-linear high-to-low … · 2019-12-20 · inverse...

Inverse regression approach to (robust) non-linear high-to-lowdimensional mapping

Emeline Perthame

Joint work with Florence Forbes

INRIA, team MISTIS, Grenoble

LMNO, Caen

October 27, 2016

1 / 25

Outlines

1. Non linear mapping problem

2. GLLiM/SLLiM: inverse regression approach

3. Estimation of parameters

4. Results and conclusion

2 / 25

Outlines





3 / 25

A non linear mapping problem

• A non linear mapping problem

y =

y1.........yD

g(y) x1

...xL

= x

• Prediction of X from Y through a non linear regression function g

E(X |Y = y) = g(y)

with Y ∈ RD ,X ∈ RL,D L

4 / 25

A non linear mapping problem

• Application: Ω mission on Mars → launch of a spectrometer aroundMars

• Problem: Retrieving physical properties from hyperspectral images

− Y: spectrum (D=184)

− X: composition of the ground (L=3)

Mars Express - Omega (2004) [http://geops.geol.u-psud.fr/]

0 50 100 150

0.1

0.2

0.3

0.4

0.5

Wavelength

Refl

ecta

nce prop. of dust

prop. of CO2 ice

prop. of water ice

5 / 25

Some approaches

• Difficulty: D large → curse of dimensionality

• Solutions: via dimensionality reduction

− Reduce dimension of y before regression: eg. PCA on y

→ Risk: poor prediction of x

− Take x into account: PLS, SIR, Kernel SIR, PC based methods

→ Two steps approaches not expressed as a single optimizationproblem

→ Our approach: inverse regression to reduce dimension

6 / 25

Outlines





7 / 25

Proposed Method: An inverse regression strategy

• x ∈ RL low-dimensional space,

• y ∈ RD high-dimensional space,

• (y , x) are realizations of (Y ,X ) ∼ p(Y ,X ; θ), θ parameters

Inverse conditional density: p(Y | X ; θ)

• Y is a noisy function of X

• Modeled via mixtures → Tractable θ estimation

Forward conditional density: p(X | Y ; θ∗), with θ∗ = f (θ)

→ High-to-low prediction, eg. X = E[X | Y = Y ; θ∗]

8 / 25

Student Locally-linear Mapping (SLLiM)

A piecewise affine model:

• Introduce a missing variable Z → Z = k ⇔ Y is the image of X by anaffine transformation

Y =K∑

k=1

I(Z = k)(AkX + bk + Ek )

Definition of SLLiM

p(Y |X ,Z = k ; θ) = S(Y ;AkX + bk ,Σk , αyk , γ

yk )

• Affine transformations are local: mixture of K Student laws

p(X |Z = k ; θ) = S(X ; ck ,Γk , αk , 1)

p(Z = k ; θ) = πk

• The set of all model parameters is:

θ = πk , ck ,Γk ,Ak , bk ,Σk , αk , k = 1 . . .K

9 / 25

Why a Student mixture ?

• Dealing with outliers → Generalized Student distribution for the jointdensity of (X ,Y )

SM (y ;µ,Σ, α, γ) =Γ(α+ M /2)

|Σ|1/2 Γ(α) (2πγ)M/2[1 + δ(y , µ,Σ)/(2γ)]−(α+M/2),

• Gaussian scale mixture representation (using weight variable Udistributed according to a Gamma distribution )

SM (y ;µ,Σ, α, γ) =

∫ ∞0

NM (y ;µ,Σ/u) G(u;α, γ) du

• Parameters estimation is tractable by an EM algorithm

-6 -4 -2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

x

Den

sity

GaussianStudent α=0.1

10 / 25

Low-to-high (Inverse) Regression

• If X and Y are both observed

− The parameter vector, θ, can be estimated in closed-form using an EMinference procedure

− This yields the inverse conditional density which is a Student mixture:

p(Y |X ; θ) =K∑

k=1

πkS(X ; ck ,Γk , αk , 1)∑Kj=1 πjS(X ; cj ,Γj , αj , 1)

S(Y ;AkX + bk ,Σkαyk , γ

yk )

• Both densities are Student mixtures parameterized by θ. Therefore, toobtain:

− A low-to-high inverse regression function:

E[Y |X = x ; θ] =K∑

k=1

πkS(x ; ck ,Γk , αk , 1)∑Kj=1 πjS(x ; cj ,Γj , αk , 1)

(Akx + bk ),

11 / 25

High-to-low (Forward) Regression

• The forward conditional density is a Student mixture as well:

p(X |Y ; θ∗) =K∑

k=1

π∗kS(Y ; c∗k ,Γ∗k , αk , 1)∑K

j=1 π∗j S(Y ; c∗j ,Γ

∗j , αj , 1)

S(X ;A∗kY + b∗k ,Σ∗k , α

xk , γ

xk )

• The forward parameter vector, θ∗ has an analytic expression as afunction of θ

• Both densities are Student mixtures parameterized by θ. Therefore, toobtain:

− A high-to-low forward regression function:

E[X |Y = y ; θ] =K∑

k=1

πkS(y ; c∗k ,Γ∗k , αk , 1)∑K

j=1 πjS(y ; c∗j ,Γ∗j , αj , 1)

(A∗ky + b∗k ).

12 / 25

The forward parameter vector θ∗ from θ

c∗k = Akck + bk ,

Γ∗k = Σk + AkΓkATk ,

A∗k = Σ∗kATk Σ−1

k ,

b∗k = Σ∗k (Γ−1k ck −AT

k Σ−1k bk ),

Σ∗k = (Γ−1k + AT

k Σ−1k Ak )−1.

13 / 25

A joint model approach to reduce the number of parameters

• Joint model

p(X = x ,Y = y |Z = k) = SL+D

([xy

];mk ,Vk , αk , 1

)with

mk =

[ck

Akck + bk

]and Vk =

[Γk ΓkA

Tk

AkΓk Σk + AkΓkATk

]• Reduce the number of parameters to estimate

− Forward strategy + Γk diagonal

∗ nb. par. = 12D(D − 1) + DL + 2L + D

∗ D = 500,L = 2→ 126 254 parameters

− Inverse strategy + Σk diagonal

∗ nb. par. 12L(L− 1) + DL + 2D + L

∗ D = 500,L = 2→ 2 003 parameters

14 / 25

Extension to partially observed responses

• Incorporate a latent component into the low-dimensional variable:

X =

[TW

]where T ∈ RLt is observed and W ∈ RLw is latent (L = Lt + Lw)

• Example on Mars data: lighting ? temperature ? grain size ?

• Observed pairs (yn ,Tn),n = 1 . . .N (T ∈ RLt)

• Additional latent variable W (W ∈ RLw)

• Assuming the independence of T and W given Z :

p(X = (T ,W )> | Z = k) = SL((T ,W )>; ck ,Γk , αk , 1)

with ck =

[ctk0

], Γk =

[Γtk 0

0 ILw

]

15 / 25

Extension to partially observed responses

• Extension of SLLiM to more general covariance structure

• With Ak =[At

k Awk

],

Y =

K∑k=1

I(Z = k)(AtkT + Aw

k W + bk + Ek )

rewrites

Y =

K∑k=1

I(Z = k)(AtkT + bk + E ′k )

with Var(E ′k ) ∝ Σk + Aw

k Aw>k

− Diagonal Σk −→ Factor analysis with Lw factors (at most)

− A compromise between full O(D2) and diagonal O(D) covariances

16 / 25

Outlines





17 / 25

Estimation of θ = (ck ,Γk ,Ak , bk ,Σk , πk , αk )1≤k≤K by EM algorithm

• E-step

− Update posterior probabilities

(EZ ) p(Z = k |t , y , θ(i)) → “SMM-like”

(EW ) p(W |Z = k , t , y , θ(i)) → Probabilistic PCA or FactorAnalysis like

(EU ) E(U |Z = k , t , y , θ(i)) → Down-weighting extreme/atypicvalues in estimators → More robust

• M-step

(MX ) (πk , ck ,Γk ) → “SMM-like”

(MY |X ) (Ak , bk ,Σk ) → Hybrid between linear regression andPPCA/FA

Ak = Yk XTk (

[0 0

0 Swk

]+ Xk X

Tk )−1

(Mα) αk → Not in closed-form but standard (specific to Student)

18 / 25

Outlines





19 / 25

Application L = D = 1

• RATP → Subway in Paris

• Measure of air quality atChatelet station, line 4

• March 2015 → N = 341measures

• Prediction of NO (L=1) fromNO2 (D=1)

→ Robustness of SLLiM

20 30 40 50 60 70 80

010

020

0300

400

500

NO2

NO

20 / 25

Application L = D = 1 / SLLiM compared to GLLiM

20 30 40 50 60 70 80

010

020

0300

400

500

NO2

NO

GLLiMSLLiM

20 30 40 50 60 70 80

010

020

0300

400

500

NO2

NO

GLLiMSLLiM

→ Illustration of robustness of the proposed model

21 / 25

Application L = D = 1 / SLLiM compared to GLLiM

1 2 3 4 5 6 7 8 9 10

0.76

0.78

0.80

0.82

0.84

K

NRMSE

GLLiMSLLiMGLLiM-WOSLLiM-WO

→ SLLiM achieves better prediction rates than GLLiM on complete data

→ SLLiM becomes equivalent to GLLiM when outliers are removed

22 / 25

Other applications and augmented version of SLLiM

• Application when D L

− Hyperspectral data on Mars (D=184, L=2, N=6983)

→ Comparison with other non linear regression methods

Table: Mars data: average NRMSE and standard deviations in parenthesis forproportions of CO2 ice and dust over 100 runs.

Method Prop. of CO2 ice Prop. of dust

SLLiM (K=10) 0.168 (0.019) 0.145 (0.020)GLLiM (K=10) 0.180 (0.023) 0.155 (0.023)MARS 0.173 (0.016) 0.160 (0.021)SIR 0.243 (0.025) 0.157 (0.016)RVM 0.299 (0.021) 0.275 (0.034)

23 / 25

Results - Application to hyperspectral image analysis

GLLiM SLLiM SplinesProportion of CO2 ice

Proportion of dust

24 / 25

Conclusion and future work

• Mixture model used for prediction

• Addition of latent variables of partially observed responses

• Selection of K and Lw

− K fixed ? Or selected by BIC ?

− Lw selected by BIC ?

Thank you for your attention ! Any questions ?

25 / 25

inverse regression approach to (robust) non-linear high-to-low … · 2019-12-20 · inverse...

Documents