Download - Principal Components Analysis - PyBay 2016
![Page 1: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/1.jpg)
Dimensionality Reduction usingPrincipal Components Analysis Rumman Chowdhury, Senior Data Scientist @ruchowdh rummanchowdhury.com thisismetis.com
![Page 2: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/2.jpg)
Me: Political Science PhD, Data Scientist, Teacher, Do-Gooder. Check me out on twitter: @ruchowdh, or on my website: rummanchowdhury.com (psst, I post cool jobs there)
What’s Metis? Metis accelerates the careers of data scientists by providing full-time immersive bootcamps, evening part-time professional development courses, online training, and corporate programs.
Who is Rumman? What’s a Metis?
![Page 3: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/3.jpg)
What is PCA?
Why do we need dimensionality reduction?
Intuition behind Principal Components Analysis
Coding example
![Page 4: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/4.jpg)
What is Principal Components Analysis?
![Page 5: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/5.jpg)
![Page 6: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/6.jpg)
![Page 7: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/7.jpg)
What is PCA?
- A shift in perspective - A reduction in the number of dimensions
![Page 8: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/8.jpg)
Why do we need dimensionality reduction?
![Page 9: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/9.jpg)
Curse of Dimensionality
![Page 10: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/10.jpg)
One dimension: Small space Being close quite probableCigarettes
per day
Curse of Dimensionality
![Page 11: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/11.jpg)
Two dimensions
Height
Cigarettes per day
Curse of Dimensionality
![Page 12: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/12.jpg)
Height
Two dimensions: More space but still not so much Being close not improbable
Cigarettes per day
Curse of Dimensionality
![Page 13: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/13.jpg)
Height Three dimensions
Cigarettes per dayExercise
Curse of Dimensionality
![Page 14: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/14.jpg)
Height Three dimensions: Much larger space Being close less probable
Cigarettes per dayExercise
Curse of Dimensionality
![Page 15: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/15.jpg)
HeightFour dimensions
Age
Cigarettes per dayExercise
Curse of Dimensionality
![Page 16: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/16.jpg)
AgeHeight
Four dimensions: Omg so much space Being close quite improbable
Cigarettes per dayExercise
Curse of Dimensionality
![Page 17: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/17.jpg)
Thousand dimensions: Helloooo… hellooo.. helloo… Can anybody hear meee.. mee.. mee.. mee..So alone….
Curse of Dimensionality
![Page 18: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/18.jpg)
Thousand dimensions: I specified you with such high resolution, with so much detail, that you don’t look like anybody else anymore. You’re unique.
Curse of Dimensionality
![Page 19: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/19.jpg)
Height
Classification, clustering and other analysis methods become exponentially difficult with increasing dimensions.
Cigarettes per day
Curse of Dimensionality
![Page 20: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/20.jpg)
Height
Classification, clustering and other analysis methods become exponentially difficult with increasing dimensions.
To understand how to divide that huge space, we need a whole lot more data (usually much more than we do or can have).
Cigarettes per day
Curse of Dimensionality
![Page 21: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/21.jpg)
Height
Lots of features, lots of data is best. But what if you don’t have the luxury of ginormous amounts of data? Not all features provide the same amount of information. We can reduce the dimensions (compress the data) without necessarily losing too much information.
Cigarettes per day
Dimensionality Reduction
![Page 22: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/22.jpg)
Feature ExtractionDo I have to choose the dimensions among existing features?
Height
Cigarettes per day
![Page 23: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/23.jpg)
Feature ExtractionDo I have to choose the dimensions among existing features?
Height
Cigarettes per day
![Page 24: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/24.jpg)
Why do we need dimensionality reduction? - To better perform analyses - …without sacrificing the information we get from our features - To better visualize our data
![Page 25: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/25.jpg)
What is the intuition behind PCA?
![Page 26: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/26.jpg)
Variable 1
Variable 2
![Page 27: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/27.jpg)
![Page 28: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/28.jpg)
Height
Cigarettes per day
PC 1PC 2
![Page 29: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/29.jpg)
Ducks and Bunnies
PC 1
PC 2
![Page 30: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/30.jpg)
![Page 31: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/31.jpg)
Height
Cigarettes per day
0.398
(Height) +
0.602
(Ciga
rettes)
![Page 32: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/32.jpg)
Height
Cigarettes
0.398
(Height) +
0.602
(Ciga
rettes)
![Page 33: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/33.jpg)
Advantage: You retain more information Disadvantage: You lose interpretability
2D Healthy_or_not = logit( β1(Height) + β2(Cigarettes per day) )
Feature selection 1D Healthy_or_not = logit( β1(Height) )
Feature extraction 1D Healthy_or_not = logit( β1(0.4*Height + 0.6*Cigarettes per day) )
![Page 34: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/34.jpg)
3D → 2D Feature Extraction (PCA)
Height
Cigarettes
Exercise
![Page 35: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/35.jpg)
3D → 2D Feature Extraction (PCA) Optimum plane
Height
Cigarettes
Exercise
![Page 36: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/36.jpg)
Cigarettes
Height
3D → 2D Feature Extraction (PCA) Optimum plane
Exercise
A 1 *(
Hei
ght)
+ B
1 *(
ciga
rett
es)
+ C 1
*(Ex
erci
se)
A2 *(Height) + B2 *(Cigarettes) + C2 *(Exercise)
![Page 37: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/37.jpg)
Singular Value Decomposition The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the "core" of a PCA:
The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude.
In other words, the eigenvalues explain the variance of the data along the new feature axes.
PCA Math
![Page 38: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/38.jpg)
Correlation or Covariance Matrix? Use the correlation matrix to calculate the principal components if variables are measured by different scales and you want to standardize them or if the variances differ widely between variables. You can use the covariance or correlation matrix in all other situations.
Matrix Selection
![Page 39: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/39.jpg)
Kaiser Method Retain any components with eigenvector values greater than 1
Scree Test Bar plot that shows the variance explained by each component. Ideally you will see a clear drop-off (elbow).
Percent Variance Explained Calculate the sum of variance explained by each component, stop when you reach a point.
How do I know how many dimensions to reduce by?
![Page 40: Principal Components Analysis - PyBay 2016](https://reader034.vdocuments.site/reader034/viewer/2022051708/589b53a21a28ab4a398b6f17/html5/thumbnails/40.jpg)
What is the intuition behind PCA? - We are attempting to resolve the curse of dimensionality
- by shifting our perspective - and keeping the eigenvectors that explain the highest amount of variance.
- We select those components based on our end goal, or by particular methods (Kaiser, Scree, % Variance).