Principal Component Analysis With Missing Data and Outliers
Haifeng Chen
Electrical and Computer Engineering DepartmentRutgers University, Piscataway, NJ, 08854
1 Introduction
Principal component analysis (PCA) [10] is a well established technique for dimensionality reduction,
and a chapter on the subject may be found in numerous texts on multivariate analysis. Examples of its
many applications include data compression, image processing, visualisation, exploratory data analysis,
pattern recognition and time series prediction. The popularity of PCA comes from three important prop-
erties. First, it is the optimal (in terms of mean squared error) linear scheme for compressing a set of high
dimensional vectors into a set of lower dimensional vectors and then reconstructing. Second, the model
parameters can be computed directly from the data – for example by diagonalizing the sample covari-
ance. Third, compression and decompression are easy operations to perform given the model parameters
– they require only matrix multiplications.
Despite these attractive features, however, PCA models have several shortcomings. One is that naive
methods for finding the principal component directions have trouble with high dimensional data or large
numbers of data points. Consider attempting to diagonalize the sample covariance matrix of � vectors
in a space of�
dimensions when � and�
are several hundred or several thousand. Difficulties can arise
both in the form of computational complexity and also data scarcity. Computing the sample covariance
itself is very costly, requiring ��� � ����� operations. In general it is best to avoid computing the sample
covariance explicitly.
Another shortcoming of standard approaches to PCA is that it is not obvious how to deal properly
with incomplete data set, in which some of the points are missing. Currently the incomplete points are
either discarded or completed using a variety of interpolation methods. However, such approaches are
no longer valid when a significant portion of the measurement matrix is unknown.
Typically, the training data for PCA is pre-processed in some way. But in some realistic problems
where the amount of training data is huge, it becomes impractical to manually verify that all the data is
1
’good’. In general, training data may contain some errors from the underlying data generation method.
We view these error points as “outliers”. However, the standard PCA algorithm is based on the assump-
tion that data have not been spoiled by outliers. In case of outliers, robust version of PCA has to be
developed.
To solve the these drawbacks of standard PCA, a lot of methods were proposed in the field of statistics,
computer engineering, neural networks etc. The purpose of this project is to give a overview of those
methods and perform some experiments to show the how the improved PCAs can deal with the missing
data and outliers in high dimensional data set. In Section 1, a brief introduction of standard PCA is
presented. To deal with the high dimensional data, we describe an EM algorithm to calculate principal
components in Section 3.2. Section 4 presents PCA for the data set containing missing points. In Section
5, we give a detailed description of current robust PCA algorithms. Some experimental results are
provided in Section 6.
2 Principal component analysis (PCA)
The most common derivation of PCA is in terms of a standardized linear projection which maximizes
the variance in the projected space [10]. For a set of observed ddimensional data vectors ���� ���������������������� � , the � principal axes ��! " ���#��$������������������% are those orthonormal axes onto which the
retained variance under projection is maximal. It can be shown that the vectors � are given by the �dominant eigenvectors (i.e. those with the largest associated eigenvalues & ) of the sample covariance
matrix ' ( )* �,+.- �/0�2143 � �/0�2153 �76� (1)
where 3 is the data sample mean, such that ' � ( & � 98 (2)
The � principal components of the observed vector �� are given by the vector: � (<; 6 �/ � 153 � � (3)
where ; ( �/� -�= � � = ����� = �?> � . The variables : � are then uncorrellated such that the covariance matrix@ )�,+.- : � : 6�BA � is diagonal with elements & � .A complementary property of PCA, and that most closely related to the original discussions of [17] is
that, of all orthogonal linear projections (3), the principal component projection minimizes the squared
2
reconstruction error@ )�DC � 1FE � C � , where the optimal linear reconstruction of E � is given byEG� (<; : �IH 3 8 (4)
3 EM Algorithm for PCA
In this section, a version of the expectation maximization (EM) algorithm [18] for learning the principal
components of a data set. The algorithm does not require computing the sample covariance. It can deal
with high dimensional data more efficiently than traditional PCA. In Section 3.1 a probabilistic model
for PCA is given. Based on that model, the EM algorithm is presented in Section 3.2. The advantage of
the EM algorithm is also provided.
3.1 Probabilistic Model of PCA
Principal component analysis can be viewed as a limiting case of a particular class of linear Gaussian
models. The goal of such models is to capture the covariance structure of an observed�
dimensional
variable using fewer than the� � � H � � A � free parameters required in a full covariance matrix. Linear
Gaussian models do this by assuming that was produced as a linear transformation of some � dimen-
sional latent variable : plus additive Gaussian noise. Denoting the transformation by the�KJ � matrix; , and the
�dimensional noise vector by L (with covariance matrix M ), the generative model can be
written as (<; : H L (5)
Conventionally, :5N<O �/PQ�SR � , and the latent variables are defined to be independent and Gaussian with
unit variance. By additionally specifying the error, or noise, model to be likewise Gaussian L N$O �/PQ�ST � ,equation (5) induces a corresponding Gaussian distribution for the observations NUO �/PQ� ;V; 6 H T � (6)
In order to save parameters over the direct covariance representation in�
space, it is necessary to choose�KW � and also to restrict the covariance structure of the Gaussian noise L by constraining the matrix T .
For example, if the shape of the noise distribution is restricted to be axis aligned (its covariance matrix
is diagonal) the model is known as factor analysis.
For the case of isotropic noise T (YX �[Z , equation (5) implies a probability distribution over space
for a given : of the form \ �/^] : � ( �_��` X � ��acb�d �fefg \ �1 �� X � C h1 ; : C � (7)
3
Using Bayes’ rule, the posterior distribution of the latent variables : given the observed may be
calculated\ � : ] � ( �_��` � aci�d � ] X a �fj ] a - d � e�g \lk 1 �� : 1 j a - ; 6 m X a �fj : 1 j a - ; 6 m �n (8)
where the posterior covariance matrix is given byX ��j a - (oX � � X � Z H ; 6 ; � a -(9)
wherej
is a � J � matrix.
3.2 EM Algorithm for PCA
Principal component analysis is a limiting case of the linear Gaussian model as the covariance of the
noise L becomes infinitely small and equal in all directions. Mathematically, PCA is obtained by taking
the limit T (Vp,q,rts_uwvyx Z . This has the effect of making the likelihood of a point dominated solely
by the squared distance between it and its reconstruction ; : . The directions of the columns of ;which minimize this error are known as the principal axes. Inference now reduces to simple least squares
projection \ � : ] � ( O ��� ;V; 6 ��a - ; 6 B��z � (|{ � : 15� ;}; 6 ��a - ; 6 � (10)
Since the noise has become infinitesimal, the posterior over states collapses to a single point and the
covariance becomes zero.
The key observation of this note is that even though the principal components can be computed explic-
itly, there is still an EM algorithm for learning them [18]. We can use the formula (10) as the e-step to
estimate the estimate the unknown state and then use (5) to get the m-step to choose ~ . The algorithm
is �e-step � ( � ; 6 ; � a - ; 6���m-step ;��"���<( � � 6 �/��� 6m� a -
where�
is a��J � matrix of all the observed data and � is a � J � matrix of the unknown states.
The columns of ; will span the space of the first � principal axes. To compute the corresponding
eigenvectors and eigenvalues explicitly, the data can be projected into this � dimensional subspace and an
ordered orthogonal basis for the covariance in the subspace can be constructed. Notice that the algorithm
can be performed online using only a single data point at a time and so its storage requirements are only���_� �Q� H �!�_� � � .4
The intuition behind the algorithm is as follows: guess an orientation for the principal subspace. Fix
the guessed subspace and project the data into it to give the values of the hidden states : . Now
fix the values of the hidden states and choose the subspace orientation which minimizes the squared
reconstruction errors of the data points. For the simple two dimensional example above, I can give
a physical analogy. Imagine that we have a rod pinned at the origin which is free to rotate. Pick an
orientation for the rod. Holding the rod still, project every data point onto the rod, and attach each
projected point to its original point with a spring. Now release the rod. Repeat. The direction of the rod
represents our guess of the principal component of the dataset. The energy stored in the springs is the
reconstruction error we are trying to minimize.
In [18], it is shown that the EM algorithm always reach a local maximum of likelihood. Furthermore,
Tipping and Bishop have shown [21] that the only stable local extremum is the global maximum at which
the true principal subspace is found; so it converges to the correct result.
The EM learning algorithm for PCA amounts to an iterative procedure for finding the subspace
spanned by the � leading eigenvectors without explicit computation of the sample covariance. It is
attractive for small � because its complexity is limited by �!�_� � ��� per iteration and so depends only lin-
early on both the dimensionality of the data and the number of points. Methods that explicitly compute
the sample covariance matrix have complexities limited by ��� � � � � . The EM algorithm scales more fa-
vorably in cases where � is small and both�
and � are large. For high dimensional data such as images,
the EM algorithm is much more efficient than traditional PCA algorithm.
4 PCA with Missing Data
During the e-step of the EM algorithm, we compute the hidden states : by projecting the observed
data into the current subspace. This minimizes the model error given the observed data and the model
parameters. Unfortunately, the data matrix is sometimes incomplete in practice. When the percentage
of missing data is very small, it is possible to replace the missing elements with the mean or an extreme
value, which is a common strategy in multivariate statistics [6]. However, such an approach is no longer
valid when a significant portion of the measurement matrix is unknown. It is not unusual for a large
portion of the matrix to be unobservable. For example, in the computer vision field, when we model
a dodecahedron (12-faced polygedra) from a sequence of segmented images. Assume that we have
tracked 12 faces over four nonsingular views. The segmented range images provide trajectories or plane
coordinates ��^������ ] � ( ���������.����� \ ( ���������.�f�f�� , where � ( ��L�� ��� 6 represents a plane equation with
5
surface normal and normal distance to the origin. The we may form �f� J �f� measurement matrix as
follows: � (������ � �- �- � � - �� � � - �� � � - �� � � - �� � � - �� � � � � � ���� � �- ��� � �� �B� � �� �B� � �� � � �B� � �� �B� � �� � � � �� � � �- � � � �� � � � � � � �� � � � � �� � � � � � � � �- v � �� � � � � � � � � �� � � � �� � � � � � � � �- v � � � �-�- � � � �- �
¡£¢¢¢¤where every * indicates an unobservable face since there are only six visible faces from each nonsingular
view. For such kind of data, the principal component analysis with missing data (PCAMD) has to be
used. Instead of estimating only : as the value which minimizes the squared distance between the point
and its reconstruction, PCAMD generalizes the estep to:�generalized e-step : For each (possibly incomplete) point find the unique pair of points :9¥and ¥ (such that : ¥ lies in the current principal subspace and ¥ lies in the subspace defined by
the known information about ) which minimize the norm C ; : ¥ 1¦ ¥ C . Set the corresponding
column of � to :B¥ and the corresponding column of�
to ¥ .
If is complete, then ¥ ( andg ¥ is found exactly as before. If not, then :^¥ and ¥ are the solution
to a least squares problem and can be found by, for example, QR factorization of a particular constraint
matrix.
In the above generalized EM algorithm, we still assume the measurements has already been cen-
tered. But in the case of missing data, especially when a significant portion of the measurement matrix
is unknown, the average of the data may not be a very reliable estimate of the mean. Instead of using
the centered data, some methods used the mean as extra parameters for the optimization, such as the
Wiberg’s method [22] in the next section.
4.1 Wiberg’s Method
Suppose the�§J � measurement matrix
�has rank ¨ . If the data is complete and the measurement matrix
filled, the problem of principal component analysis is to determine ©ª ��©' ��©« , such thatC � 1l¬® 6 1}©ª ©' ©« 6 C (12)
is minimized, where ©ª and ©« are�¯J ¨ and � J ¨ matrices with orthogonal columns, ©' ( � ��°�±2� X � �
is a ¨ J ¨ diagonal matrix, ¬ is the maximum likelihood approximation of the mean vector, and 6 (�7�����������f� � is an � -tuple with all ones. The solution of this problem is essentially the SVD of the centered
(or registered) data matrix ²³1l¬´ 6 .
6
If data is incomplete, we have the following minimization problem:r�q,µ5¶ ( �� *�· � � ��¸ 1�¹ � 1lº 6� L � � (13)R ( ����S��# �¼»½� �¾¸ �À¿¼Á�Â[¿ e ¨�à � �Ä�wÅU�^Å � ���DÅl#ÆÅ � where º � and L% are column vector notations defined by
�� º 6 -�����º 6b ¡¤ ( ©ª ©'®ÇÈ (14)
and �� L 6 -�����L 6)¡¤ ( ©« ©'®ÇÈ (15)
It is trivially true that there are at most ¨I� � H � 1ɨ � independent elements from LU decomposition
of an�?J � matrix of rank ¨ . Hence, a necessary condition to uniquely solve (14), is that the number of
observable elements in�
, Ê , satisfies ÊÌËͨ�� � H � 1ɨ � . To sufficiently determine the problem (14)
more constraints are needed to normalize either the left matrix ©ª or the right matrix ©« .
If we write the measurement matrix ² as a Ê -dimensional vector © , the minimization problem can be
written as r!q,µ5¶K( �� ©Î 6 ©Î (16)
where ©Î ( ©¯1 E¬¦1lÏ ©L ( ©Ð1UÑ ©º (17)
and ©L ( �� L -�����L )¡¤ � ©º ( �� ©º -�����©º b
¡¤ � ©º � (YÒ º 6� E¹ �¾Ó 6 (18)
where E¬ is a Ê vector related to the mean estimate of © . Ï is a Ê J ¨ � matrix which is built totally
by the values of ©º . Ñ is a Ê J ��¨ H � �Ô� matrix which is built totally by the values of ©L . To solve the
minimization problem stated by (16), the derivative function (with respect to ©º and ©L ) should be zero,
i.e., Õ¶K( k Ï 6 Ï ©L¯1ÖÏ 6 � ©¯1×E¬ �Ñ 6 Ñ ©ºh1$Ñ 6 © n ( P (19)
Obviously (19) is nonlinear because Ï is a function of ©º and Ñ is a function of ©L . In theory, any
appropriate nonlinear optimization method can be applied to solve it. However, the dimension is so high
in practice and [22] used the following algorithm to solve it
7
�For given ©º , we can build the matrix Ï and vector E¬ . Then ©L is updated by solving a least least-
squares problem ©L ( Ïtا� ©Ð1 E¬ � (20)
where Ï Ø is the pseudo-inverse of Ï .�For a given ©L , we can also build the matrix Ñ . Then ©º is updated©º ( Ñ Ø © (21)
where Ñ Ø is the pseudo-inverse of Ñ .
5 PCA with Outliers
All the PCA algorithms mentioned before are based on the assumptions that data have not been spoiled
by outliers. In practice, real data often contain some outliers and usually they are not easy to be sepa-
rated from the data set. In Section 1, we showed that the traditional PCA constructs the rank � subspace
approximation to zero-mean training data that is optimal in a least-squares sense. It is commonly known
that least squares techniques are not robust in the sense that outlying measurements can arbitrarily skew
the solution from the desired solution [11]. Currently it is still a research direction to solve this draw-
back of the original PCA. Several methods were proposed in the field of statistics, neural networks, and
computer engineering etc. But they all have certain limitations.
5.1 Robust PCA by Robustifying the Covariance Matrix
To cope with outliers, the most commonly used approaches in statistics [4][11][19] replace the standard
estimation of the covariance matrix,
', with a robust estimator of the covariance matrix,
' ¥ . This for-
mulation weights the mean and the outer products which form the covariance matrix. Calculating the
eigenvalues and eigenvectors of this robust covariance matrix gives eigenvalues that are robust to sample
outliers. The mean and the robust covariance matrix can be calculated as:¬ ( @ )�,+.-IÙ - �_Ú �� � �@ )�,+.- Ù - �_Ú �� � (22)' ¥ ( @ )�,+.- Ù � �_Ú �� � �/.�21l¬ � �/0�21l¬ �76@ )�,+.- Ù � �_Ú �� � 1Û� (23)
where Ù - �_Ú �� � and Ù � �_Ú �� � are scalar weights, which are a function of the Mahalanobis distanceÚ �� ( �/G�21l¬ � 6 ' ¥ �/G�21l¬ � (24)
8
and
' ¥ is iteratively estimated. Numerous possible weight functions have been proposed (e.g. Huber’s
weighting coefficients [11] or Ù � �_Ú �� � ( � Ù - �_Ú �� ���Ô� [4]. These approaches, however, weight entire
data samples and are not appropriate for the cases when only a few individual elements are corrupted by
outliers. Another related approach would be to robustly estimate each element of the covariance matrix.
This is not guaranteed to result in a positive definite matrix [4].
These methods, based on robust estimation of the full covariance matrix, are computationally imprac-
tical for high dimensional data such as images. Note that just computing the covariance matrix requires��� � � � � operations. Also in some practical applications it is difficult to gather sufficient training data to
guarantee that the covariance matrix is full rank.
5.2 Robust PCA by Projection Pursuit
Li and Chen [12] proposed a solution based on projection pursuit (PP). Dealing with high dimensional
data, PP searches for low dimensional projections that maximize (minimize) an objective function called
projection index. By working in the low dimensional projections, it manages to avoid the difficulty
caused by sparseness of the high-dimensional data.
Principal component analysis is actually a special PP procedure. Let be the�-dimensional random
vector with covariance Ü , and let ÝßÞ be the distribution function of à 6 , where à is a�
vector. Denote
the eigenvalues of Ü by ¨ - �������.��¨ b . Recall that the first principal component is the projection of onto a
certain direction; that is X �/Ý Þ Ç � (Yr�á"âã à ã +.- X �/Ý Þ � (är�á"âã à ã +.- �/à 6 Ü0à � - d � (25)
It is well known that X � �/ÝåÞ Ç � ( à 6 - Ü.à - is the largest eigenvalue ¨ - and that à - is the associated eigen-
vector. In the subsequent steps, each new direction is then constrained to be orthogonal to all previous
directions. For example, the second principal component à 6� is determined byX �/Ý Þ È � ( r�á"âã à ã +.-�¸ ÞçæyÞ Ç X �/Ý Þ � (26)
When the measurement matrix contains outliers, [12] used the robust scale estimator instead of standard
scale estimator to deal with the outliers.
Ammann proposed a similar idea for robust PCA by using projection pursuit [1]. In his approach,
the projection pursuit of estimating the eigenvectors of covariance matrix can be expressed as follows.
Determine the last principal axis à b by minimizing)* +.-�è �/ 6 à b � (27)
9
subject to the constraint C à b C ( � , where denotes the # th measurement vector. Then for � ( � 1���������0��� , determine àc> to minimize )* +.-�è �/ 6 àI> � (28)
subject to the constraint C à%> C ( � and à 6> à ( P , � H �KÅÍ#lÅ �. è �Ô� � is the robust loss function to
bound the influence of outliers. Ordinary eigenvectors are obtained by setting è �Ô� � ( C � C �� .5.3 Robust PCA by Self-Organizing Neural Networks
The solution of the standard PCA is made after all the data have been collected and the sample covariance
matrix
'has been calculated, i.e., the approach works in the batch way. When a new sample ^é is added,
we have to recalculate the corresponding new covariance matrix' é ( � ' H GéêGé 6� H � (29)
then all the computations for solving (2) is repeated by solve' é � ( & � 98 (30)
Such approach is not suitable for some real applications where data come incrementally or in the online
way.
The problem can be solved by a number of existing self-organizing rules for PCA [15][16][23]. The
commonly used rules are listed as follows:ë ��ì H � � ( ë ��ì � HUí ��ì � �/ g 1 ë ��ì � g � � (31)ë ��ì H � � ( ë ��ì � HÛí ��ì � �/ g 1 ë ��ì �ë ��ì � 6 ë ��ì � g � � (32)ë ��ì H � � ( ë ��ì � HUí ��ì � Ò g �/h1³îº � H � g 1 g é � � (33)
whereg ( ë ��ì � 6 , îº ( g ë ��ì � , g é ( ë ��ì � 6 îº and í ��ì � is the learning rate which decreases to zero asì�ïñð while satisfying certain condition, e.g.,*Iò í ��ì � ( ðó� *�ò í ��ì �Ôi W4ð �2Áô¨w¿ôÁôÊ eÄõßö � 8 (34)
Each of the three rules will converge to the principal component vector ë almost surely under some mild
conditions which are studied in detail [15][16][23]. By regarding ë ��ì � as the weight vector (i.e., the
10
vector consisting of synapses) of a linear neuron with outputg ( ë ��ì � 6 , all of the three rules can be
considered as modifications of the well-known Hebbian ruleë ��ì H � � ( ë ��ì � HUí ��ì � g (35)
for self-organizing the synapses of a neuron.
From the view of statistical physics, all these rules (31)(32)(33) are connected to certain energy func-
tions. For example, the rule (33) is an adaptive rule for minimizing the following energy function in the
gradient descent manner ÷ �fø Þ � ; � ( )* �ê+.- e �fø Þ �/ù � �( )* �ê+.- C 0�21 ; : � C( )* �ê+.- C 0�21 ;V; 6 G� C( )* �ê+.- b*� +.- ��ú � � 1 >* +.- Ù � g � � � (36)
where : � (|; 6 � are the linear coefficients obtained by projecting the training data onto the principal
subspace, andg � ( @obò +.- Ù ò �ú ò � . ù � ( 0��1 ;}; 6 G� is the reconstruction error vector, and
e �fø Þ �/ù � � (ù 6� ù � is the reconstruction error of �� .In case of outliers, Xu and Yuille [13] have proposed an algorithm that generalizes the energy function
(36) by introducing additional binary variables that are zero when a data sample is considered an outlier.
They minimize ÷¼û�ü � ; � « � ( )* �,+.-þý¾ÿ � C � 1 ;V; 6 � C H�� �7�91 ÿ � ���( )* �,+.- �� ÿ � � b*� +.- ��ú � � 1 >* +.- Ù � g � � � � H�� �7�¼1 ÿ � � ¡¤ (37)
where each ÿ � in« ( Ò ÿ - � ÿ � �������0� ÿ ) Ó is a binary random variable. If ÿ � ( � the sample .� is taken
into consideration, otherwise it is equivalent to discarding � as an outlier. The second term in (37)
is a penalty term, or prior, that discourages the trivial solution where all ÿ � are zero. Given ; , if the
energy, ù � ( G�B1 ;}; 6 G� is smaller than a threshold � , then the algorithm prefers to set ÿ � ( �considering the sample �� as an inlier and 0 if it is greater than or equal to � . Minimization of (37)
involves a combination of discrete and continuous optimization problems and Xu and Yuille [70] derive
11
a mean field approximation to the problem which, after marginalizing the binary variables, can be solved
by minimizing:÷Äû�ü � ; � ( 1 )* �,+.- �� �
û�ü �/ù � � � � � � (38)
where � û�ü �/ù � � � � � � (�� Á�±2�7� H e a� �� �����"� ù�� � a�� � � is a function that is related to robust statistical estimators
[2]. The�
can be varied as an annealing parameter in an attempt to avoid local minima.
Based on such reformulation of the energy function, we can get the corresponding robust version of
the adaptive self-organizing rules (31)(32)(33). For example, the rule (32) changes intoë ��ì H � � ( ë ��ì � HUí ��ì � �� H e a� �� ����� � ù�� � a�� � �/ g 1 ë ��ì �ë ��ì � 6 ë ��ì � g � � (39)
and the rule (33) changes intoë ��ì H � � ( ë ��ì � HUí ��ì � �� H e a� �� ����� � ù�� � a�� � Ò g �/h1³îº � H � g 1 g é � � (40)
Finally the converged vector ë ø�� )�� ���� is taken as the resulted principal components vector which has the
avoided the effects of outliers. In addition, a byproduct can be easily obtained by
ÿ � ( ��� �À� e �fø Þ �/ù � � W"! �( PQ� Áôì$# e ¨ Ù �À¿ e (41)
which indicates whether � is an outlier ( ÿ � ( � ) or not ( ÿ � ( P ).5.4 Robust PCA by Weighted SVD
The approach of robust PCA by neural networks is of limited application in some practical problems
as they reject entire data measurement as outliers. In some applications, outliers typically correspond
to small groups of points in the measurement vector and we seek a method that is robust to this type
of outlier yet does not reject the good points in the data samples. Gabriel and Zamir [8] give a partial
solution. They propose a weighted Singular Value Decomposition (SVD) technique that can be used to
construct the principal subspace. In their approach, they minimize÷� % � ; �S� � ( )* �ê+.- b*� +.- # � � ��ú � � 15�/� � � 6 : � � � (42)
where, � � is a column vector containing the elements of the \ -th row of ; . This effectively puts a
weight, # � � on every point in the training data. In related work, Greenacre [9] gives a partial solution
to the problem of factorizing matrices with known weighting data by introducing Generalized Singular
12
Value Decomposition (GSVD). This approach applies when the known weights in (42) are separable;
that is, one weight for each row and one for each column: # � � ( # � # � . The basic idea is to first whiten
the data using the weights, perform SVD, and then un-whiten the bases. The benefit of this approach is
that it takes advantage of efficient implementations of the SVD algorithm. The disadvantages are that the
weights must somehow already be known and that individual point outliers are not allowed.
In the general robust case, where the weights are unknown and there may be a different weight at every
point in every training data, there is no such solution that leverages SVD, [8][9] and one must solve the
minimization problem with “criss-cross regressions” which involve iteratively computing dyadic (rank 1)
fits using weighted least squares. The approach alternates between solving for � � or : � while the other
is fixed; this is similar to the EM approach we discussed before but without a probabilistic interpretation.
In this spirit, Gabriel and Odorof [8] note how the quadratic formulation in (36) is not robust to outliers
and propose making the rank 1 fitting process in (42) robust. They propose a number of methods to make
the criss-cross regressions robust but they apply the approach to very low dimensional data and their
optimization methods do not scale well to very high dimensional data such as images. In related work,
Croux and Filzmoser [5] use a similar idea to construct a robust matrix factorization based on a weighted& - norm.
5.5 Torre and Black’s Algorithm
In the computer vision field, PCA is a popular technique for parameterizing shape, appearance, and
motion [3][20][14]. Learned PCA representations have proven useful for solving problems such as face
and object recognition, tracking, detection, and background modeling [20][14]. Typically, the training
data for PCA is pre-processed in some way (e.g. faces are aligned [14]) or is generated by some other
vision algorithm (e.g. optical flow is computed from training data [3]). As automated learning methods
are applied to more realistic problems, and the amount of training data increases, it becomes impractical
to manually verify that all the data is good. In general, training data may contain undesirable artifacts
due to occlusion (e.g. a hand in front of a face), illumination (e.g. specular reflections), image noise
(e.g. from scanning archival data), or errors from the underlying data generation method (e.g. incorrect
optical flow vectors). We view these artifacts as statistical outliers .
Due to the high dimensionality of the image data, we can’t rely on the calculation of robust covariance
matrix to get the principal components. The projection based approach also suffers from the high com-
putational cost. The approach of Xu and Yuille described in the previous section suffers from three main
problems: First, a single bad pixel value can make an image lie far enough from the subspace that the
13
entire sample is treated as an outlier (i.e. ÿ � ( P ) and has no influence on the estimate of ; . Second,
Xu and Yuille use a least squares projection of the data � for computing the distance to the subspace;
that is, the coefficients that reconstruct the data m� are : � ( ; 6 G� . These reconstruction coefficients
can be arbitrarily biased by an outlier. Finally, a binary outlier process is used which either completely
rejects or includes a sample.
To make the robust PCA work efficiently for the image data, Torre and Black [7] proposed a more
general analogue outlier process that has computational advantages and provides a connection to robust
M-estimation. To address these issues they reformulate (37) as÷� ��ø Þ � ; �S�h��¬´�(' � ( )* �ê+.- b*� +.- ) & � � � ©e �� �X �� � H+* � & � � ��, (43)
where PtÅ & � � Å|� is now an analog outlier process that depends on both images and pixel locations and* � & � � � is a penalty function. The error ©e � � ( ú � � 1?¹ � 1 @ > +.- Ù � g � and - (YÒ X - X � ����� X b Ó 6 specifies
a scale parameter for each of the ú pixel locations.
Observe that they explicitly solve for the mean ¬ in the estimation process. In the least-squares
formulation the mean can be computed in closed form and can be subtracted from each column of the
data matrix�
. In the robust case, outliers are defined with respect to the error in the reconstructed images
which include the mean. The mean can no longer be computed by performing a averaging procedure,
instead it is estimated (robustly) analogously to the other bases. Also, recall that PCA assumes an
isotropic noise model. In the formulation here they allow the noise to vary for every row (pixel) of the
data (e � � N$O �/PQ� X �� � ).
Exploiting the relationship between outlier processes and the robust statistics [2], minimizing (43) is
equivalent to minimizing the following robust energy function÷� �fø Þ � ; �S�h��¬w� - � ( )* �ê+.- e � ��ø Þ �/G�y1l¬É1 ; : � �.- �( )* �ê+.- b*� +.- è ��ú � � 1�¹ � 1 >* +.- Ù � g � � X � � (44)
for a particular class of robust è -functions [2]. The robust magnitude of a vector : is defined as the sum
of the robust error values for each component, that is,e � �fø Þ � : � - � ( b*� +.- è � g � � X � � (45)
[7] uses the Geman-McClure error function given byè � g � X � � ( g �g � H X �� (46)
14
where X � is a scale parameter that controls the convexity of the robust function and determines the
inlers/outliers separation. Unlike some other è -functions, (46) is twice differentiable which is useful for
optimization methods based on gradient descent.
While many optimization methods exist, it is useful to formulate the minimization of equation (44)
as a weighted least squares problem and solve it using iteratively reweighted squares(IRLS). Define the
residual error in matrix notation as ©/ ( � 1l¬´ 6) 1 ; � 8 (47)
Then, for a given - , a matrix 0 � T b21 ) can be defined such that it contains positive weights for
each pixel and each image. 0 is calculated for each iteration as a function of the previous residuals©e � � ( ú � � 1 ¹ � 1 @ > +.- Ù � g � and it is related to the influence of pixels on the solution. Each element,# � � , of 0 will be equal to # � � (43 � ©e � � � X � � A ©e � � (48)
where 3 � ©e � � � X � � (65 è � ©e � � � X � �5 ©e � � ( � ©e � � X ��� ©e �� � H X �� � � (49)
for the Geman-McClure è -function. For an iteration of IRLS, (44) can be transformed into a weighted
least-squares problem and rewritten as:÷ � ; �S�¯��¬w�(0 ��7 �fø Þ ( )* �,+.- �/0�21l¬É1 ; : � � 6 0 � �/0�21l¬$1 ; : � � (50)( b*� +.- �/ � 1 ¹ � ) 1Ö� 6 � � � 6 0 � �/ � 1 ¹ � ) 1Ö� 6 � � � (51)
where the 0 � �ÖT b�1�b ( � ��°�±2�98 � � are diagonal matrices containing the positive weighting coefficients
for the data sample � , and recall that 8 � is the � th column of 0 . 0 � �hT ) 1 ) ( � ��°�±2�98 � � are diagonal
matrices containing the weighting factors for the \ th pixel over the whole training set. Note the symmetry
of (51) where, recall, .� represents the � th column of the data matrix�
and � is a column vector
which contains the \ th row. Observe that (51) have non-unique solutions since, for any linear invertible
transformation matrix : , ; :;: a - � would give the same solution (i.e. the reconstruction from the
subspace will be the same). This ambiguity can be solved by imposing the constraint of orthogonality
between the bases ; 6 ; ( R (e.g. with Graham-Schmidt orthogonalization). In order to find a solution
to
÷ � ; �S�¯��¬w�(0 ��7 �fø Þ , we differentiate (51) w.r.t. : � and ¬ and differentiate (51) w.r.t. � � to find
necessary, but not sufficient conditions for the minimum. From these conditions, the following coupled
system of equations
15
¬ ( � )* �ê+.- 0 a -� � )* �,+.- 0 � �/0�21 ; : � � � (52)� ; 6 0 � ; � : � ( ; 6 0 � �/G�y1Ö¬ �=< � ( ���������0� � � (53)�/�>0 � 6 � ( �?0 �/ 1 ¹ b ) �=< # ( ���������.� � 8 (54)
Giving these updates of the parameters, an approximate algorithm for minimizing equation (44) can
employ a two step method that minimizes
÷ 7 �fø Þ � ; = � = ¬ � using Alternated Least Squares(ALS).
Summarizing, the whole IRLS procedure works as follows,
1. First an initial basis ~ � v � and a set of coefficients @ � v � are given, then the initial error, ©/ � v � , can
be calculated by (47).
2. The weighting matrix A � - � can be computed by (48) and it will be used to successively alter-
nate between minimizing with respect to : � - �� and �/� � � - � � < ����# and ¬ � - � in closed form using
equations (54)(53)(52).
3. Once : � - �� , �/� � � - � and ¬ � - � have converged, recompute the error, ©/ � - � and calculate the weighting
matrix 0 � � � , then proceed in the same manner in steps 2 until convergence of the algorithmn.
It is worth noting that there are several possible ways to update the parameters more efficiently, rather
than a closed form solution.
6 Experiment Results
Experiments were performed to test some of the algorithms discussed in the previous sections. In Section
6.1, we use 2 dimensional and 40 dimensional data separately to show the efficiency of the EM algorithm.
In Section 6.2, we use 40 dimensional data in which 20% are missing and show how Wiberg’s method
works in such incomplete data. The same kind of 40 dimensional data are used in the experiment of
Section 6.3, but some of them are corrupted by outliers. For such data, we compare the results of robust
PCA with those of standard PCA. Another experiment with real images is also provided in that section.
6.1 Test the EM algorithm
First we use 2D synthetic data with Gaussian distribution to test the EM algorithm introduced in Section
3.2 (Figure 1). The data and initial principal axis is shown in Figure 1(a). The first iteration and second
iteration of the principal axis are shown in Figure 1(b)(c). Comparing with the results of standard PCA
16
−30 −20 −10 0 10 20 30
−20
−15
−10
−5
0
5
10
15
20
y1
y 2
−30 −20 −10 0 10 20 30
−20
−15
−10
−5
0
5
10
15
20
y1
y 2
(a) (b)
−30 −20 −10 0 10 20 30
−20
−15
−10
−5
0
5
10
15
20
y1
y 2
−30 −20 −10 0 10 20 30
−20
−15
−10
−5
0
5
10
15
20
y1
y 2
(c) (d)
Figure 1: The EM based PCA for 2D data. (a) The data and initial value of the principal axis; (b) Thefirst iteration; (c) The second iteration; (d) The data and the principal axis by standard PCA.
(Figure 1(d)), we find that EM algorithm converges to the correct solution in only two steps, which is
very efficient.
In the second example, 10 data vectors were used for the PCA algorithm. Each vector contains 40
dimensional data which were sampled from one shifted sinusoid curve. The whole data set is plotted
in Figure 2(a), in which each sinusoid curve is related to one data vector. Figure 2(b)(c) show the
results of standard PCA. The two principal axes found by standard PCA are shown in Figure 2(b). The
reconstructed signals by those two principal components are shown in Figure 2(c). Figure 2(d)(e) give the
principal axes and reconstructed signals found in the first iteration of the EM algorithm. Figure 2(f)(g)
show the principal axes and reconstructed signals found in the fourth iteration of the EM algorithm.
6.2 PCA with missing data
Here we also use a set of vectors which is formed by 10 shifted harmonic sinusoid functions. By ran-
domly removing 20% of the data points, the data set is shown in Figure 3(a). Obviously, standard PCA
can’t deal with such kind of data because some of the pixels are unknown. We use Wiberg’s algorithm to
extract the two principal axes and reconstruct the data by those two principal axes. Figure 3(b)(c) show
17
the results in the third iteration of the algorithm. Figure 3(d)(e) show the results in the fifth iteration. Note
that the functions representing the estimated principal axes are getting smoother after every iteration. Fi-
nally in the seventh iteration, we obtain the very smooth principal axes and a perfect reconstruction of
the input vectors, which are shown in Figure 3(f)(g).
6.3 PCA with outliers
Although several robust PCA methods were described in Section 5. We use Torre and Black’s algorithm
introduced in Section 5.5 to the show that the robust PCA performs better than traditional PCA in case
of outliers.
In the first experiment, we still use the data sampled from sinusoid functions. But 10% of the el-
ements are contaminated with outliers (Figure 4(a)). Figure 4(b)(c) depict the two principal axes and
the reconstructed signals by standard PCA. Figure 4(d)(e) depict the two principal axes and the recon-
structed signals by robust PCA after 30 iterations. Obviously the robust PCA gives much more reliable
reconstruction than standard PCA.
In the second experiment, we use a collection of �CB½� images with size �f��P J �f��P as the training set
of PCA (from ’http://web.salleurl.edu/f̃torre/’). Those images were gathered from a static camera over
the day. There are changes in the illumination of the static background. Also 45% of the images contain
people in different location. Our purpose is to build the model of the background by using PCA. We treat
the people in the image as outliers and use PCA to extract the background model. The left column of
Figure 5 show the D examples of the training images. The middle column shows the reconstructing each
of the illustrated training image using standard PCA with ��P basis vectors. The right column shows the
reconstruction obtained with ��P robust PCA basis vectors. We find that the robust PCA is able to capture
the illumination changes while ignoring the people. Once we get the desired background model which
accounts for illumination variation, we can use it in the application of person detection and tracking.
References
[1] L. P. Ammann. Robust singular value decompositions: A new approach to projection pursuit. J. of
Amer. Stat. Assoc., 88(422):505–514, 1993.
[2] M. J. Black and A. Rangarajan. On the unification of line process, outlier detection, and robust
statistics with applications in early vision. International J. of Computer Vision, 25(19):57–92,
1996.
18
[3] M. J. Black, Y. Yaccob, A. Jepson, and D. J. Fleet. Learning parameterized models of image motion.
In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, volume I,
pages 561– 567, 1997.
[4] N. A. Campbell. Robust procedures in multivariate analysis I : Robust covariance estimation.
Applied Statistics, 29(3):231–237, 1980.
[5] C. Croux and P. Filzmoser. Robust factorization of data matrix. Proc. in Computational Statistics,
pages 245–249, 1981.
[6] Y. Dodge. Analysis of Experiments with Missing Data. Wiley, 1985.
[7] F.Torre and M. J. Black. Robust principal component analysis for computer vision. In 8th In-
ternational Conference on Computer Vision, volume I, pages 362–349, Vancouver, Canada, July
2001.
[8] K.R. Gabriel and S. Zamir. Lower rank approximation of matrices by least squares with any choice
of weights. technometrics, 21:489–498, 1979.
[9] M. J. Greenacre. Theory and Applications of Correspondence Analysis. Academic Press : London,
1984.
[10] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of
Educational Psychology, 24:417–441, 1933.
[11] P. J. Huber. Robust Statistics. New York:Wiley, first edition, 1981.
[12] G. Li and Z. Chen. Projection-pursuit approach to robust dispersion matrices and principal compo-
nents: Primary theory and monte carlo. J. of Amer. Stat. Assoc., 80(391):759–766, 1985.
[13] L.Xu and A. L. Yuille. Robust principal component analysis by self-organizing rules based on
statistical physics approach. IEEE Trans. Neural Networks, 6(1):131–143, 1995.
[14] B. Moghaddam and A. Pentland. Probabilistic visual learning for object representation. IEEE
Trans. Pattern Anal. Machine Intell., 19(7):137–143, 1997.
[15] E. Oja. A simplified neuron model as a principal component analyzer. J. Math. Biol., 16:267–273,
1982.
19
[16] E. Oja and J. Karhunen. On stochastic approximation of eigenvectors and eigenvalues of the ex-
pectation of a random matrix. J. Math. Anal. Appl., 106:69–84, 1985.
[17] K. Pearson. On lines and planes of closestfit to systems of points inspace. The London, Edinburgh
and Dublin Philosophical Magazine and Journal of Sciences, 6:559–572, 1901.
[18] S. Roweis. Em algorithm for PCA and SPCA. Neural Information Processing Systems, pages
626–632, 1997.
[19] F. H. Ruymagaart. A robust principal component analysis. Journal of Multivariate Analysis,
11:485–497, 1981.
[20] G. J. Edwards T. F. Cootes and C. J. Taylor. Active appearance models. In Proc. European Conf.
on Computer Vision, volume I, pages 484– 498, 1998.
[21] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Technical Report
NCRG/97/010, Microsoft Research, September 1999.
[22] T. Wiberg. Computation of principal components when data is missing. In Proc. Second Symp.
Computational Statistics, pages 229–236, 1976.
[23] L. Xu. Least mean square error reconstruction for self-organizing neural nets. Neural Networks,
6:627–648, 1993.
20
0 5 10 15 20 25 30 35 40−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
pixel number
pixe
l val
ue
(a)
0 5 10 15 20 25 30 35 40−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
pixel number
pixe
l val
ue
(b) (c)
0 5 10 15 20 25 30 35 40−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1.5
−1
−0.5
0
0.5
1
1.5
(d) (e)
0 5 10 15 20 25 30 35 40−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
(f) (g)
Figure 2: The EM based PCA for 40 dimensional data. (a) Input data; (b) Two principal axes found bythe standard PCA; (c) The reconstructed signals by the standard PCA; (d) Two principal axes found inthe first iteration by EM based PCA; (e) The reconstructed signals in the first iteration ; (f) Two principalaxes found in the fourth iteration ; (g) The reconstructed signals in the fourth iteration.
21
0 5 10 15 20 25 30 35 40−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
pixel number
pixe
l val
ue
(a)
0 5 10 15 20 25 30 35 40−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1.5
−1
−0.5
0
0.5
1
1.5
pixel number
pixe
l val
ue
(b) (c)
0 5 10 15 20 25 30 35 40−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1.5
−1
−0.5
0
0.5
1
1.5
pixel number
pixe
l val
ue
(b) (c)
0 5 10 15 20 25 30 35 40−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1.5
−1
−0.5
0
0.5
1
1.5
pixel number
pixe
l val
ue
(b) (c)
Figure 3: PCA for the imcomplete data set. (a) Input data, some pixels are missing; (b) Two principalaxes found in the third iteration; (c) The reconstructed signals in the third iteration ; (d) Two principalaxes found in the fifth iteration; (e) The reconstructed signals in the fifth iteration ; (f) Two principal axesfound in the seventh iteration ; (g) The reconstructed signals in the seventh iteration.
22
0 5 10 15 20 25 30 35 40−4
−3
−2
−1
0
1
2
3
4
pixel number
pixe
l val
ue
(a)
0 5 10 15 20 25 30 35 40−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
pixel number
pixe
l val
ue
(b) (c)
0 5 10 15 20 25 30 35 40−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1.5
−1
−0.5
0
0.5
1
1.5
pixel number
pixe
l val
ue
(b) (c)
Figure 4: Robust PCA. (a) Input data; (b) Two principal axes found by standard PCA; (c) The recon-structed signals by standard PCA; (d) Two principal axes found by robust PCA; (e) The reconstructedsignals by robust PCA.
23
(a) (b) (c)
Figure 5: Robust PCA for the image data. (a) Some of the original data; (b) PCA reconstruction; (c)Robust PCA reconstruction.
24