overview and implementation of principal component analysis

AReviewandImplementationofPrincipalComponentAnalysis 1

A Review and Implementation of Principal Component Analysis

Taweh Beysolow II

Professor Moretti

Fordham University


Abstract

In this experiment, we shall look at the famous iris data set and perform principal

component analysis on the data. We want to see which are the principal components that

explain the most variance within the data set. Furthermore, we will discuss the

application of principal component analysis in conjunction with other data analysis

techniques. All computations were performed in Python and all data is uploaded from the

UCI machine learning repository. Rather that simply using the built in PCA function, we

shall implement principal component analysis by manually performing each step, with

assistance from packages for Eigen-decomposition. In conclusion, we find that first two

principal components, out of four in total, explain roughly 96% of the variance in the data.

I. What is Principal Component Analysis?

Principal component analysis (PCA) is a orthogonal linear transformation of data, in

which the transformed data is projected onto a new coordinate plane. This transformed

data is displayed in such a manner that the first coordinate is the location of the greatest

variance, and every subsequent variance is placed on the coordinate system in a

decreasing fashion. These coordinates themselves are the principal components of the

data. The primary purpose of principal component analysis is “to reduce the

dimensionality of a data set consisting of a large number of interrelated variables, while

retaining as much as possible of the variation present in the data set.” (Wood, pg.2)


II. Notation

𝑥 = 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑓 𝑝 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 , 𝛼! = 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑓 𝑝 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠

𝛼!! 𝑥 = Σ!

!𝛼!"𝑥! , Σ = 𝐶𝑜𝑣.𝑚𝑎𝑡𝑟𝑖𝑥 𝑓𝑜𝑟 𝑥, 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑑 𝑏𝑦 𝑆 𝑠𝑎𝑚𝑝𝑙𝑒 𝑐𝑜𝑣.𝑚𝑎𝑡𝑟𝑖𝑥

𝜆! = 𝑃𝑟𝑖𝑛𝑐𝑖𝑝𝑎𝑙 𝐶𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 𝑘, 𝑘 = (1,2,… ,𝑝)

III. Derivation of Principal Component Analysis

Our goal is to find the linear function of random variables from the x vector with the

vector of constants from the alpha vector with the maximum variance. This linear

function produces our principal components. Be this as it may, each principal component

must be in order of decreasing variance, and each principal component must be

uncorrelated with each other.

Objective:

𝑀𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑉𝑎𝑟 𝛼!! 𝑥 = 𝛼!! Σ𝛼!

We seek to used constrained optimization, as without a constraint the value

of 𝛼! could be infinitely large. As such, we shall choose the following normalization

constraint:

𝛼!! 𝛼! = 1

This brings us to the concept of Lagrange multipliers, which shall be the method

by which we achieve this constrained optimization.


Lagrange Multipliers in PCA

The Lagrange Multiplier method is a tool “for constrained optimization of

differentiable functions, especially for nonlinear constrained optimization.”(Huijuan,

pg.1) In particular, this is helpful for finding local maxima and minima of a respective

function subject to a given constraint. Within the context of the experiment, the Lagrange

multipliers are applied as follows:

𝛼!! Σ𝛼! − 𝜆 𝛼!𝛼! − 1

𝑑𝑑𝛼!

𝛼!! Σ𝛼! − 𝜆 𝛼!𝛼! − 1 = 0

Σ𝛼! − 𝜆𝛼! = 0

Σ𝛼! = 𝜆!𝛼!

The final step of the equation yields us the eigenvector 𝛼! with its corresponding

eigenvalue 𝜆!.

What are Eigenvalues and Eigenvectors?

An eigenvalue is a number derived from a square matrix, which corresponds to a

specific eigenvector, also associated with a square matrix. Together, they “provide the

Eigen-decomposition of a matrix.” (Abdi, pg.1) Plainly spoken, the Eigen-decomposition

of a matrix merely provides the matrix in the form of eigenvectors and their

corresponding eigenvalues. Eigen-decomposition is important because it is a “method by

which we can find the maximum (or minimum) of functions involving matrices.” (Abdi,

pg.1) In this context, this is the method by which we find the principal components in

order of decreasing variance.


Eigen-decomposition

𝐴𝑢 = 𝜆𝑢

𝐴 − 𝜆𝐼 𝑢 = 0

Where

A = square matrix,

u = eigenvector to matrix A (if length of vector changes when multiplied by A)

Assume that

𝐴 = 2 32 1 ,𝑇ℎ𝑒𝑟𝑒𝑟𝑓𝑜𝑟𝑒

𝑢! = 32 , 𝜆! = 4

𝑢! = −11 , 𝜆! = −1

For most applications, the eigenvectors are normalized to a unit vector as such:

𝑢!𝑢 = 1

Eigenvectors of A furthermore are put together together in a matrix U. each column of U

is an eigenvector of A. The eigenvalues are stored in a diagonal matrix ⋀, where the trace,

or diagonal, of the matrix gives the eigenvalues. Thus, we rewrite the first equation

accordingly:

𝐴𝑈 = 𝑈𝐴

𝐴 = 𝑈⋀𝑈!!

= 3 −12 1 4 0

0 −12 2−4 6

= 2 32 1


Moving forward, as we have mentioned prior, our objective is the maximize 𝜆!, and with

the eigenvectors defined in decreasing order. If 𝜆! is the largest eigenvector, then the first

principal component is defined as

Σ𝛼! = 𝜆!𝛼!

In general, we define a given eigenvector as the k-th principal component of x and that

the variance of a given eigenvector is denoted by its corresponding eigenvalue. We shall

now demonstrate this process when k = 2 and when k > 2.

2nd and K-th Principal Component

The second principal component maximizes the variance subject to being

uncorrelated with the first principal component The non-correlation constraint is

expressed as the following:

𝑐𝑜𝑣 𝛼!!𝑥𝛼!!𝑥 = 𝛼!!Σ𝛼! = 𝛼!!Σ𝛼! = 𝛼!!𝜆!𝛼!! = 𝜆!𝛼!!𝛼 = 𝜆!𝛼!!𝛼! = 0

𝛼!!Σ𝛼! − 𝜆! 𝛼!!𝛼! − 1 − 𝜙𝛼!!𝛼!

𝑑𝑑𝛼!

𝛼!!Σ𝛼! − 𝜆! 𝛼!!𝛼! − 1 − 𝜙𝛼!!𝛼! = 0

= Σ𝛼! − 𝜆!𝛼! − 𝜙𝛼! = 0

𝛼!!Σ𝛼! − 𝛼!!𝜆!𝛼! − 𝛼!!𝜙𝛼! = 0

0 − 0− 𝜙1 = 0

𝜙 = 0

Σ𝛼! − 𝜆!𝛼! = 0

This process can be repeated up to k = p, yielding principal components for each

of the p random variables.


IV. Data

For this experiment, we shall be using Ronald Fisher’s Iris flower data set, originally

collected by Edgar Anderson to study the variation of the three species. Our objective is

to determine which principal components contain the most data regarding this data set.

There are a total of 150 observations, 50 of each of the three species of flower. The

species and variables of observed are:

Species

• Iris-Setosa

• Iris-Virginica

• Iris-Veriscolor

Variables

• Sepal Length

• Sepal Width

• Petal Length

• Petal Width


V. Experiment When performing initial exploratory analysis on our data, we notice the following:

As we observe, the data exhibits very high variance within and between species

with respect to sepal length and sepal width, but is considerable less variable between

species, and moderately variable within species when observing petal length and petal

width. This will be a point of interest to keep in mind for later, but for now let us move

on to describing the implementation as performed here. After we load our data into a

variable within Python, we standardize our values (mean = 0, var. =1), then we calculate

the covariance matrix for X:

𝑆 = Σ!! 𝑥! − 𝑥 !(𝑥 − 𝑥)


Generally speaking, we want to standardize values when they are not measured on

the same scale. Although in this experiment all of the variables are measured on a

centimeter scale, it is advisable to still do so. Moving forward, we perform the Eigen-

decomposition, and obtain the eigenvalues and eigenvectors. After we sort the

eigenvectors, we observe the following:

As we can see, the first two principal components explain the vast majority of

variance within the data set. As pointed out early, the high variability amongst sepal

length and sepal width between and within species foreshadowed these events. Finally,

we project the transformed data onto the new feature space:


VI. Conclusion and Comments

We observe that instead of a 4-dimensional plot, as we would have originally had,

we are now looking at a very familiar xy plot. For exploratory analysis purposes, this

brings considerable ease both visually and analytically. It is easy to see that Iris-viriginica

and Iris-veriscolor show considerable similarities with respect to sepal length and sepal

width properties. In contrast, Iris-setosa in general seems to be considerably unique. As

for further applications of PCA, it is used in regression analysis often to determine which

variables should be included in a model, used in neuroscience to identify properties of

stimuli, as well as other tools. As proven above, both in theory and application, principal

component analysis provides a robust and excellent method of simplifying very complex

data into less complex forms.


VII. Appendix

1. Wood, F. (2009, September). Principal Component Analysis. Retrieved from http://www.stat.columbia.edu/~fwood/Teaching/w4315/Fall2009/pca.pdf

2. Abdi, H. (2007). The Eigen-Decomposition. Retrieved from https://www.utdallas.edu/~herve/Abdi-EVD2007-pretty.pdf

3. Huijuan, L. (2008, September 28). Lagrange Multipliers and their Applications. Retrieved from http://sces.phys.utk.edu/~moreo/mm08/method_HLi.pdf

overview and implementation of principal component analysis

Documents