principal components analysis - pybay 2016

Dimensionality Reduction usingPrincipal Components Analysis Rumman Chowdhury, Senior Data Scientist @ruchowdh rummanchowdhury.com thisismetis.com

Me: Political Science PhD, Data Scientist, Teacher, Do-Gooder. Check me out on twitter: @ruchowdh, or on my website: rummanchowdhury.com (psst, I post cool jobs there)

What’s Metis? Metis accelerates the careers of data scientists by providing full-time immersive bootcamps, evening part-time professional development courses, online training, and corporate programs.

Who is Rumman? What’s a Metis?

What is PCA?

Why do we need dimensionality reduction?

Intuition behind Principal Components Analysis

Coding example

What is Principal Components Analysis?

What is PCA?

- A shift in perspective - A reduction in the number of dimensions

Why do we need dimensionality reduction?

Curse of Dimensionality

One dimension: Small space Being close quite probableCigarettes

per day

Two dimensions

Height

Cigarettes per day

Height

Two dimensions: More space but still not so much Being close not improbable

Cigarettes per day

Height Three dimensions

Cigarettes per dayExercise

Height Three dimensions: Much larger space Being close less probable

HeightFour dimensions

AgeHeight

Four dimensions: Omg so much space Being close quite improbable

Thousand dimensions: Helloooo… hellooo.. helloo… Can anybody hear meee.. mee.. mee.. mee..So alone….

Thousand dimensions: I specified you with such high resolution, with so much detail, that you don’t look like anybody else anymore. You’re unique.

Height

Classification, clustering and other analysis methods become exponentially difficult with increasing dimensions.

Cigarettes per day

Height

Classification, clustering and other analysis methods become exponentially difficult with increasing dimensions.

To understand how to divide that huge space, we need a whole lot more data (usually much more than we do or can have).

Cigarettes per day

Height

Lots of features, lots of data is best. But what if you don’t have the luxury of ginormous amounts of data? Not all features provide the same amount of information. We can reduce the dimensions (compress the data) without necessarily losing too much information.

Cigarettes per day

Dimensionality Reduction

Feature ExtractionDo I have to choose the dimensions among existing features?

Height

Cigarettes per day

Feature ExtractionDo I have to choose the dimensions among existing features?

Height

Cigarettes per day

Why do we need dimensionality reduction? - To better perform analyses - …without sacrificing the information we get from our features - To better visualize our data

What is the intuition behind PCA?

Variable 1

Variable 2

Height

Cigarettes per day

PC 1PC 2

Ducks and Bunnies

Height

Cigarettes per day

(Height) +

rettes)

Height

Cigarettes

(Height) +

rettes)

Advantage: You retain more information Disadvantage: You lose interpretability

2D Healthy_or_not = logit( β1(Height) + β2(Cigarettes per day) )

Feature selection 1D Healthy_or_not = logit( β1(Height) )

Feature extraction 1D Healthy_or_not = logit( β1(0.4*Height + 0.6*Cigarettes per day) )

3D → 2D Feature Extraction (PCA)

Height

Cigarettes

Exercise

3D → 2D Feature Extraction (PCA) Optimum plane

Height

Cigarettes

Exercise

Cigarettes

Height

3D → 2D Feature Extraction (PCA) Optimum plane

Exercise

A 1 *(

A2 *(Height) + B2 *(Cigarettes) + C2 *(Exercise)

Singular Value Decomposition The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the "core" of a PCA:

The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude.

In other words, the eigenvalues explain the variance of the data along the new feature axes.

PCA Math

Correlation or Covariance Matrix? Use the correlation matrix to calculate the principal components if variables are measured by different scales and you want to standardize them or if the variances differ widely between variables. You can use the covariance or correlation matrix in all other situations.

Matrix Selection

Kaiser Method Retain any components with eigenvector values greater than 1

Scree Test Bar plot that shows the variance explained by each component. Ideally you will see a clear drop-off (elbow).

Percent Variance Explained Calculate the sum of variance explained by each component, stop when you reach a point.

How do I know how many dimensions to reduce by?

What is the intuition behind PCA? - We are attempting to resolve the curse of dimensionality

- by shifting our perspective - and keeping the eigenvectors that explain the highest amount of variance.

- We select those components based on our end goal, or by particular methods (Kaiser, Scree, % Variance).

principal components analysis - pybay 2016

Data & Analytics

xuhua xia slide 1 principal components analysis objectives:...

a tutorial on principal components analysis

a principled way to principal components analysis

using principal components analysis to examine resting

principal components analysis - cmu statistics

principal components analysis and cactorial analysis to...

principal components analysis - cmu...

lecture 08: dimensionality, principal components analysis

principal components analysis with spss

a comparison of principal components analysis and factor...

lazyprogrammer.me principal components analysis tutorial...

principal components analysis with sas

principal components factor analysis

lesson 7_ principal components analysis (pca)

exploratory factor analysis principal components analysis...

dimensionality reduction: principal components analysis...

principal components analysis ( pca)

principal components analysis and redundancy analysis

principal components regression by using generalized ... ·...

pcatools: pcatools: everything principal components analysis