머피's 머신러닝: latent linear model

ML study7 번째

12.1 Factor analysis

• 이 전 장에서는 latent variable z = {1,2,..,K} 표현력의 한계An alternative is to use a vector of real-valued latent variables,zi R∈

• where W is a D×L matrix, known as the factor loading matrix, and Ψ is a D×D covariance matrix.

• We take Ψ to be diagonal, since the whole point of the model is to “force” zi to explain the correlation, rather than “baking it in” to the observation’s covariance.

• The special case in which Ψ=σ2I is called probabilistic principal components analysis or PPCA.

• The reason for this name will become apparent later.

12.1.1 FA is a low rank parameterization of an MVN

• FA can be thought of as a way of specifying a joint density model on x using a small number of parameters.

12.1 Factor analysis

• The generative process, where L=1, D=2 and Ψ is diagonal, is illustrated in Figure 12.1.

• We take an isotropic Gaussian “spray can” and slide it along the 1d line defined by wzi +μ.

• This induces an ellongated (and hence correlated) Gaussian in 2d.

12.1.2 Inference of the latent factors

• latent factors z will reveal something interesting about the data.

xi(D 차원 ) 를 넣어서 L 차원으로 매핑시킬 수 잇음training set 을 D 차원에서 L 차원으로 차원 축소

12.1.2 Inference of the latent factors

• Example

• D =11 차원 ( 엔진크기 , 실린더 수 , 가격 ,...), N =328 개의 example( 자동차 종류 ), L = 2

• 각 피쳐 ( 엔진크기 , 실린더 수 ,.. 11 개 ) 의 유닛 벡터 e1=(1,0,...,0), e2=(0,1,0,...,0) 를 저차원 공간에 사영한 것이 파란색 선 (biplot 이라고 함 )

• biplot 가까이 있는 빨간색점 ( 차량 ) 이 그 특성을 잘 가지고 있는 차

training set 을 D 차원에서 L 차원으로 차원 축소 ( 빨간색 점 )

12.1.3 Unidentifiability

• Just like with mixture models, FA is also unidentifiable

• LDA 처럼 매번 분석시마다 , z( 토픽 ) 의 순서가 바뀜• 분석 능력에는 영향을 주진 않지만 , 해석 능력에 영향을 줌

• 해결 방법 • Forcing W to be orthonormal Perhaps the cleanest solution to the identifiability problem is to force W to be or-

thonormal, and to order the columns by decreasing variance of the corresponding latent factors. This is the ap-proach adopted by PCA, which we will discuss in Section 12.2.

• orthonormal 하다는 것은 벡터들이 서로 직교한다 • 직교성을 유지하려면 ,

12.1.4 Mixtures of factor analysers

• let [the k’th linear subspace of dimensionality Lk]] be represented by Wk, for k=1:K.

• Suppose we have a latent indicator qi {1,...,K} specifying which subspace we should use to generate the data.∈

• We then sample zi from a Gaussian prior and pass it through the Wk matrix (where k=qi), and add noise.

각 데이터 Xi 가 k 개의 FA 에서 나왔다는 모델(GMM 과 비슷 )

12.1.5 EM for factor analysis models

Expected log likelihood

ESS(Expected Sufficient Statistics)

12.1.5 EM for factor analysis models

• E- step

• M-step

12.2 Principal components analysis (PCA)

• Consider the FA model where we constrain Ψ=σ2I, and W to be orthonormal.

• It can be shown (Tipping and Bishop 1999) that, as σ2 →0, this model reduces to classical (nonprobabilistic)principal components analysis( PCA),

• The version where σ2 > 0 is known as probabilistic PCA(PPCA)

proof sketch

• reconstruction error 를 줄이는 W 를 구하는 것 = z 로 사영되는 데이터의 분산이 최대가 되는 W 를 구하는 것• z 로 사영되는 데이터의 분산이 최대가 되는 W 를 lagrange multiplier 최적화로 구해본다• z 로 사영되는 데이터의 분산이 최대가 되는 W 를 구해봤더니 데이터의 empirical covariance matrix 의 [ 첫번째 ,

두번째 , 세번쨰 .. eigenvector]

proof of PCA

• wj R∈ D to denote the j’th principal direction

• xi R∈ D to denote the i’th high-dimensional observation,

• zi R∈ L to denote the i’th low-dimensional representation

• Let us start by estimating the best 1d solution,w1 R∈ D, and the corresponding projected points˜z1 R∈ N.

• So the optimal reconstruction weights are obtained by orthogonally projecting the data onto the first principal di-rection

proof of PCAx 가 z = wx 로 사영된 데이터 포인트의 분산

목적함수가 reconstruction error 를 최소화하는 것에서 사영된 점들의 분산을 최대화하는 것으로 바뀌었다

direction that maximizes the variance is an eigenvector of the covariance matrix.

proof of PCA

Optimizing wrt w1 and z1 gives the same solution as before.

The proof continues in this way. (Formally one can use induction.)

12.2.3 Singular value decomposition (SVD)

• PCA 는 SVD 와 밀접한 관계가 있다• SVD 를 돌리면 , PCA 의 해 W 를 구할 수 있다• PCA 는 결국 truncated SVD approximation 와 같다

thin SVD

SVD: example

sigular value 한개 , 두개 , 세개 쓴 근사치

SVD: example

12.2.3 Singular value decomposition (SVD)

PCA 의 해 W 는 XTX 의 eigenvectors 와 같으므로 , W=Vsvd 를 돌리면 pca 의 해가 나온다

PCA 는 결국 truncated SVD approximation 와 같다

12.2.4 Probabilistic PCA

• x 의 평균은 0, Ψ=σ2I 이고 W 가 orthogonal 한 FA 를 생각하자 .

MLE 로 구하면 ,

12.2.5 EM algorithm for PCA

• PCA 에서 Estep 은 latent 변수 Z 를 추론해 내는 것이고 FA EM 에서 etep 에서의 posterior 의 평균을 쓴다

확률모델이 아니라 공분산 없다고 침

X 가 W 가 span 하는 공간에 사영된 것

행렬 표현


• linear regression 업데이트 수식과 매우 닯았죠 • linear regression 이 데이터 점이 span 하는 열공간에 y 를 사영시키는 기하학적 의미 = 예측치와 y 의 에러

최소화 (7.3.2)

• // E-step 은 W 의 열벡터가 span 하는 열공간에 X 를 사영시키는 것

Wt-1


• M-step

• linear regression 에서 답 y 가 벡터인 경우의 linear regression

• 사영된 zi 를 피쳐벡터 , xi 를 답으로 하는 multi-output linear regression 이다• 파란색 막대에 사영된 zi 를 파란색 막대 (W) 를 돌려서 답 x( 초록색 점 ) 과의 에러가 최소화 되는 막대 방향을

찾는다 .

multi-output linear regression (Equation 7.89)


• EM 의 장점• EM can be faster• EM can be implemented in an online fashion, i.e., we can update our estimate of W

as the data streams in.

12.3.1 Model selection for FA/PPCA12.3.2 Model selection for PCA

Conclusion

• FA 는 정규분포의 x 을 (D*D paramters), 더 작은 parameter 갯수 (D*L) 로 표현한다 .

• PCA 는 FA 의 special 케이스이다• PCA 문제 의 해 W 는 Z 로 사영되는 데이터의 분산이 최대가 되게 하고

가장 큰

eigenvalue 에 대응하는 eigenvectors 이다 • SVD (X = USV’) 에서 V 는 X 의 공분산 행렬의 eigenvectors 이다 . 그러므로 W=V

머피's 머신러닝: latent linear model

Documents