hands-on machine learning with scikit-learn and tensorflow - chapter8
TRANSCRIPT
![Page 1: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/1.jpg)
CHAPTER-08Dimensionality Reduction
@St_Hakky
Hands-On Machine Learning with Scikit-Learn and TensorFlow
Github : https://github.com/ageron/handson-ml
![Page 2: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/2.jpg)
CHAPTER-08Dimensionality Reduction
![Page 3: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/3.jpg)
Why should we think about this topic?
• Machine Learning problems involve thousands or even millions of features for each training instance.
• Curse of Dimensionality’s problems• Make training extremely slow
• Make it much harder to find a good solution• For example, we often get much information from
data visualization but it is difficult in high dimensionality.
• Need much more training data
![Page 4: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/4.jpg)
Reducing the number of features
• Fortunately, it is often possible to reduce the number of features considerably.
• If we can reducing dimension without loosing information for some task…• Make training faster
• Make it much easier to find a good solution
• Reduce the training data to resolve the task
![Page 5: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/5.jpg)
MNIST : Example of Reducing Dimension
Pixels on the image borders are almost always white.
We can drop dimension without loosing info
For the classification task, many pixels are utterly unimportant.
Moreover, two neighboring pixels are often highly correlated
If we merge them into a single pixel, we will not lose much information.
![Page 6: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/6.jpg)
Reducing Dimension for Visualization
1. Can you understand what’s going on this data?(42 dimensions)
Dimensionality reduction is also extremely useful for data visualization.
2. Reducing the number of dimensions down to two makes it possible to plot a high-dimensional training set on a graph
![Page 7: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/7.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 8: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/8.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 9: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/9.jpg)
The Curse of Dimensionality
Even a basic 4D hypercube is incredibly hard to picture, let alone a 200-dimensional ellipsoid bent in a 1,000-dimensional space.
We live in three dimensions that our intuition fails us when we try to imagine a high-dimensional space.
![Page 10: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/10.jpg)
https://youtu.be/BVo2igbFSPE
https://youtu.be/-x60xZe0Si0
![Page 11: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/11.jpg)
Example of our intuition failing
• Let’s think about picking a random point in a unit square (1 × 1 square).• Only about a 0.4% chance of being located less than
0.001 from a border.
• What’s happen in a 10,000-dimensional unit hypercube?• This probability is greater than 99.999999%.
• Most points in a high-dimensional hypercube are very close to the border.
• This is quite counterintuitive.
![Page 12: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/12.jpg)
Example of our intuition failing
• Let’s think about another example here.
• Picking two points randomly in a unit square.• The distance between these two points will be, on
average, roughly 0.52.
• But what about two points picked randomly in a 1,000,000-dimensional hypercube?• The average distance, believe it or not, will be about
408.25!
![Page 13: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/13.jpg)
Need more data in High Dimension
• These examples means that a new instance will likely be far away from any training instance.• This makes predictions much less reliable than in
lower dimensions, since they will be based on much larger extrapolations.
• In short, the more dimensions the training set has, the greater the risk of overfitting it.
• So, we need more data.
![Page 14: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/14.jpg)
The solution of Curse of Dimensionality
• Increase the size of the training set to reach a sufficient density of training instances.
• Unfortunately, in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions.
![Page 15: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/15.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 16: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/16.jpg)
Main Approaches for Dimensionality Reduction
• There are two main approaches to reducing dimensionality• Projection
• Manifold Learning.
![Page 17: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/17.jpg)
Projection
• In most problems, training instances are not spread out uniformly across all dimensions. • as discussed earlier for MNIST
You can see a 3D dataset represented by the circles
All training instances actually lie within a much lower-dimensional subspace.
![Page 18: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/18.jpg)
3D→2D
You can see a 3D dataset represented by the circles
The new 2D dataset after projection
Project every training instance
We have just reduced the dataset’s dimensionality from 3D to 2D!!
The axes correspond to new features z1 and z2
![Page 19: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/19.jpg)
Projection is not always the best approach
In many cases the subspace may twist and turn.
Swiss roll
![Page 20: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/20.jpg)
Projection is not always the best approach
Simply projecting onto a plane would squash different layers ofthe Swiss roll together.
Swiss roll
Dropping x3
![Page 21: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/21.jpg)
Projection is not always the best approach
Simply projecting onto a plane would squash different layers ofthe Swiss roll together.
Swiss roll
What you really want is this.
Dropping x3
More example : https://goo.gl/7ILsqR
![Page 22: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/22.jpg)
Manifold
• What is Manifold?• A d-dimensional manifold is a part of an n-
dimensional space (where d < n) that locally resembles a d-dimensional hyperplane.
• 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space.
d = 2 and n = 3
Example of a 2D manifold : Swiss roll
![Page 23: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/23.jpg)
Manifold Learning
• What is Manifold Learning?• Modeling the manifold on which the training
instances lie.
• It relies on the manifold assumption(manifold hypothesis)• Most real-world high-dimensional datasets lie close
to a much lower-dimensional manifold.
![Page 24: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/24.jpg)
Once again, MNIST example
• Handwritten digit images have somesimilarities.• Connected lines
• Borders are white
• they are more or less centered…
• These constraints tend to squeeze the dataset into a lower dimensional manifold.
![Page 25: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/25.jpg)
Manifold assumption
• The manifold assumption is often accompanied by another implicit assumption• The task at hand will be simpler if expressed in the
lower-dimensional space of the manifold.
This can be split into two classes
The decision boundary would be fairly complex, but…
The decision boundary is a simple straight line.
![Page 26: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/26.jpg)
This assumption does not always hold
It looks more complex in the unrolled manifold.
x1 = 5
This decision boundary looks very simple in the original 3D space
![Page 27: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/27.jpg)
Data only knows the best way of Dimensionality Reduction
Reducing the dimensionality of your training set before training a model.
Definitely speeding up training
But it may not always lead to a better or simpler solution : This all depends on the dataset.
![Page 28: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/28.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 29: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/29.jpg)
Principal Component Analysis (PCA)
• PCA is the most popular dimensionality reduction algorithm.
• PCA have two steps:1. It identifies the hyperplane that lies closest
to the data
2. It projects the data onto it.
![Page 30: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/30.jpg)
Preserving the Variance
Before projecting the training set onto a lower-dimensional hyperplane, you first need to choose the right hyperplane.
The projection of the dataset onto each of new axes.
If you select the axis that preserves the max variance, it will most likely lose less information.
![Page 31: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/31.jpg)
Another way to choose axis
• Another way is to choose axis that minimizes the mean squared distance between the original dataset and its projection onto that axis.
![Page 32: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/32.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 33: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/33.jpg)
PCA identifies the axis
PCA identifies the axis that accounts for the largest amount of variance.
It also finds a second axis that is orthogonal to the first one and accounts for the largest amount of remaining variance.
![Page 34: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/34.jpg)
Principal Components
1th axis
2th axis
So how can you find the principal components of a training set?
The unit vector that defines the ith axis is called the ith principal component (PC).
![Page 35: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/35.jpg)
Singular Value Decomposition(SVD)
SVD can decompose the training set matrix 𝑋 into the dot product of three matrices 𝑈・Σ・𝑉𝑇, where 𝑉𝑇
contains all the principal components
Python Code of SVD
![Page 36: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/36.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 37: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/37.jpg)
Projecting Down to 𝑑 Dimensions
You can reduce the dimensionality down to 𝑑dimensions by projecting it onto the hyperplane defined by the first 𝒅 principal components.
![Page 38: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/38.jpg)
Projecting Down to 𝑑 Dimensions
𝑊𝑑, defined as the matrix containing the first d principal components
To project the training set onto the hyperplane, you can simply compute the following equation.
The following Python code projects the training set onto the plane defined by the first two principal components:
![Page 39: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/39.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 40: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/40.jpg)
Using Scikit-Learn
※it automatically takes care of centering the data
Scikit-Learn’s PCA class implements PCA using SVD decomposition just like we did before.
After fitting the PCA, you can access the principal components using the components_ variable.
![Page 41: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/41.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 42: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/42.jpg)
Explained Variance Ratio
Explained Variance Ratio
84.2% of the dataset’s variance lies along the first axis, and 14.6% lies along the second axis.
The proportion of the dataset’s variance that lies along the axis of each principal component.
![Page 43: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/43.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 44: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/44.jpg)
Choosing the Right Number of Dimensions
• Generally, it is preferable to choose the number of dimensions that add up to a sufficiently large portion of the variance (e.g., 95%).
![Page 45: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/45.jpg)
Sample Code
![Page 46: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/46.jpg)
Sample Code
Computing PCA without reducing dimensionality
![Page 47: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/47.jpg)
Sample Code
Computing PCA without reducing dimensionality
Computing the minimum number of dimensions required to preserve 95% of the training set’s variance
![Page 48: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/48.jpg)
Sample Code
There is a much better way :
You can set n_components to be a float between 0.0 and 1.0
![Page 49: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/49.jpg)
Plot the explained variance
Elbow = The explained variance stops growing fast.
You can think of Elbow point as the intrinsic dimensionality of the dataset.
Another option is to plot the explained variance.
![Page 50: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/50.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 51: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/51.jpg)
PCA for Compression
Example: Applying PCA to the MNIST
Obviously after dimensionality reduction, the training set takes up much less space.
95%
This is a reasonable compression ratio and this can speed up a classification algorithm tremendously.
Each instance will have just over 150 features, instead of the original 784 features
The dataset is now less than 20% of its original size!
![Page 52: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/52.jpg)
Decompress the reduced datasets
The equation of the inverse transformation:
You also can decompress the reduced dataset by the inverse transformation of the PCA projection.
![Page 53: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/53.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 54: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/54.jpg)
Incremental PCA(IPCA)
• One problem with implementation of PCA• It requires the whole training set to fit in
memory for the SVD.
• IPCA algorithms have been developed• Split the training set into mini-batches
• Feed an IPCA algorithm one mini-batch at a time.
• This is useful for large training sets, and also to apply PCA online.
![Page 55: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/55.jpg)
Sample Code
![Page 56: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/56.jpg)
Sample Code
Spliting the MNIST dataset into 100 mini-batches
![Page 57: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/57.jpg)
Sample Code
Feeding them to Scikit-Learn’s IPCA class
![Page 58: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/58.jpg)
Another Sample Code
NumPy’s memmap class allows you to manipulate a large array stored in a binary file on disk.
The class loads only the data it needs in memory, when it needs it.
![Page 59: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/59.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 60: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/60.jpg)
Randomized PCA(RPCA)
Computational complexity
𝑂(𝑚 × 𝑑2) + 𝑂(𝑑3)
𝑂(𝑚 × 𝑑2) + 𝑂(𝑛3)
It is dramatically faster when 𝑑 is much smaller than 𝑛.
This is a stochastic algorithm that quickly finds an approximation of the first d principal components.
PCA
RPCA
![Page 61: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/61.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 62: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/62.jpg)
Kernel Trick
• Kernel Trick• A mathematical technique that implicitly maps
instances into a very high-dimensional space
• A linear decision boundary in the high-dimensional feature space corresponds to a complex nonlinear decision boundary in the original space.
![Page 63: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/63.jpg)
Kernel PCA(kPCA)
It is often good at preserving clusters of instances after projection.
Making kernel trick possible to perform complex nonlinear projections for dimensionality reduction.
kPCA
![Page 64: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/64.jpg)
Sample Code
• Scikit-Learn’s KernelPCA class to perform kPCA with an RBF kernel
![Page 65: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/65.jpg)
Selecting a Kernel and Tuning Hyperparameters
• As kPCA is an unsupervised learning:• There is no obvious performance measure to
select the best kernel and hyperparameters.
• However, dimensionality reduction is often a preparation step for a supervised learning task.
![Page 66: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/66.jpg)
Grid Search
You can simply use grid search to select the kernel and hyperparameters.
The best kernel and hyperparameters are then available.
![Page 67: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/67.jpg)
Selecting a Kernel and Tuning HyperparametersWith Lowest reconstruction error
• Another approach is to select the kernel and hyperparameters that yield the lowest reconstruction error.• This time entirely unsupervised
• However, reconstruction is not as easy as with linear PCA.
![Page 68: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/68.jpg)
Example : Reconstruction is not easy
The original Swiss roll 3D dataset Resulting 2D dataset after kPCA
is applied using an RBF kernel
Mapping the dataset to an infinite-
dimensional space by kernel trickReconstruction pre-image
![Page 69: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/69.jpg)
Example : Reconstruction is not easy
The original Swiss roll 3D dataset Resulting 2D dataset after kPCA
is applied using an RBF kernel
Mapping the dataset to an infinite-
dimensional space by kernel trickReconstruction pre-image
We calculate the
reconstruction
error by this.
Reconstruction
by kernel PCA
![Page 70: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/70.jpg)
Example : Reconstruction is not easy
Resulting 2D dataset after kPCA
is applied using an RBF kernel
Mapping the dataset to an infinite-
dimensional space by kernel trickReconstruction pre-image
Reconstruction
by kernel PCA
Since the feature space is infinite-dimensional, we cannot compute the reconstructed point.→ We cannot compute the true reconstruction error.
We calculate the
reconstruction
error by this.
![Page 71: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/71.jpg)
Example : Reconstruction is not easy
The original Swiss roll 3D dataset Resulting 2D dataset after kPCA
is applied using an RBF kernel
Mapping the dataset to an infinite-
dimensional space by kernel trickReconstruction pre-image
Fortunately, it is possible to find a point in the original space that would map close to the reconstructed point. This is called the reconstruction pre-image.
![Page 72: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/72.jpg)
Example : Reconstruction is not easy
The original Swiss roll 3D dataset Resulting 2D dataset after kPCA
is applied using an RBF kernel
Mapping the dataset to an infinite-
dimensional space by kernel trickReconstruction pre-image
You can measure its squared distance to the original instance.
By this value, you can select the kernel and hyperparameters.
![Page 73: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/73.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 74: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/74.jpg)
Locally Linear Embedding (LLE)
• LLE is Powerful nonlinear dimensionality reduction method.• A Manifold Learning technique that does not
rely on projections like the previous algorithms.
• This makes it good at unrolling twisted manifolds• Especially when there is not too much noise.
Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality reduction by locally linear embedding." science 290.5500 (2000): 2323-2326.
![Page 75: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/75.jpg)
Sample Code
Result
Swiss roll is completely unrolled.
The distances between instances are locally well preserved.
![Page 76: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/76.jpg)
Sample Code
Result
However, distances are not preserved on a larger scale
Squeezed Stretched
![Page 77: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/77.jpg)
How LLE works
1. First, the algorithm identifies its 𝑘 closest neighbors for each training instance 𝑥(𝑖)
• Find the weights 𝑤𝑖,𝑗 such that the squared distance between 𝑥(𝑖) and 𝑗=1
𝑚 𝑤𝑖,𝑗𝑥(𝑗) is as small
as possible.
• 𝑤𝑖,𝑗 = 0 if 𝑥(𝑗) is not one of the 𝑘 closest neighbors of 𝑥(𝑖)
2. Then, trying to reconstruct 𝑥(𝑖) as a linear function of these neighbors.
![Page 78: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/78.jpg)
How LLE works
1. First, the algorithm identifies its 𝒌closest neighbors for each training instance 𝒙(𝒊)
• Find the weights 𝑤𝑖,𝑗 such that the squared distance between 𝑥(𝑖) and 𝑗=1
𝑚 𝑤𝑖,𝑗𝑥(𝑗) is as small
as possible.
• 𝑤𝑖,𝑗 = 0 if 𝑥(𝑗) is not one of the 𝑘 closest neighbors of 𝑥(𝑖)
2. Then, trying to reconstruct 𝑥(𝑖) as a linear function of these neighbors.
![Page 79: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/79.jpg)
How LLE works
• The detail first step of LLE is here.• First step of LLE is the constrained optimization
problem described in Equation 8-4.
• Second constraint simply normalizes the weights for each training instance 𝑥(𝑖).
𝑾 is the weight matrix
containing all the weights 𝑤𝑖,𝑗
![Page 80: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/80.jpg)
How LLE works
1. First, the algorithm identifies its 𝑘 closest neighbors for each training instance 𝑥(𝑖)
• Find the weights 𝑤𝑖,𝑗 such that the squared distance between 𝑥(𝑖) and 𝑗=1
𝑚 𝑤𝑖,𝑗𝑥(𝑗) is as small as possible.
• 𝑤𝑖,𝑗 = 0 if 𝑥(𝑗) is not one of the 𝑘 closest neighbors of 𝑥(𝑖)
2. Then, trying to reconstruct 𝒙(𝒊) as a linear function of these neighbors.
![Page 81: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/81.jpg)
How LLE works
• The detail second step of LLE is here.• Now the second step is to map the training instances
into a 𝑑-dimensional space (where 𝑑 < 𝑛).
• If 𝑧(𝑖) is the image of 𝑥(𝑖) in this 𝑑-dimensional space, then we want the squared distance between 𝑧(𝑖) and 𝑗=1𝑚 𝑤𝑖,𝑗𝑧
(𝑖)to be as small as possible.
Note that 𝒁 is the matrix containing all 𝑧(𝑖)
![Page 82: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/82.jpg)
How LLE works
These look very similar
![Page 83: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/83.jpg)
How LLE works
Keeping the instances fixed and finding the optimal weights
We are doing the reverse
Keeping the weights fixed and finding the optimal position in the low dimensional space.
![Page 84: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/84.jpg)
Contents
• The Curse of Dimensionality
• Main Approaches for Dimensionality Reduction
• Projection
• Manifold Learning
• PCA
• Preserving the Variance
• Principal Components
• Projecting Down to d Dimensions
• Using Scikit-Learn
• Explained Variance Ratio
• Choosing the Right Number of Dimensions
• PCA for Compression
• Incremental PCA
• Randomized PCA
• Kernel PCA
• Selecting a Kernel and TuningHyperparameters
• LLE
• Other Dimensionality ReductionTechniques
• MDS
• SOM
• Isomap
• t-SNE
![Page 85: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/85.jpg)
Other Dimensionality Reduction Techniques
• There are many other dimensionality reduction techniques.• MDS
• Isomap
• t-SNE
![Page 86: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/86.jpg)
Multidimensional Scaling (MDS)
• MDS reduces dimensionality while trying to preserve the distances between the instances.
![Page 87: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/87.jpg)
Isomap
• First, creating a graph by connecting each instance to its nearest neighbors
• Then reducing dimensionality while trying to preserve the geodesic distances between the instances.
![Page 88: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/88.jpg)
t-Distributed Stochastic Neighbor Embedding (t-SNE)
• t-SNE reduces dimensionality while trying to keep similar instances close and dissimilar instances apart.
• It is mostly used for visualization
Maaten, Laurens van der, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of Machine Learning Research 9.Nov (2008): 2579-2605.
![Page 89: Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8](https://reader034.vdocuments.site/reader034/viewer/2022050614/5aabb94f7f8b9aaf528b4857/html5/thumbnails/89.jpg)
Thank you