practical deep neural networks - dgyrt.dgyblog.com/res/dlworkshop/introduction.pdf · practical...

Practical Deep Neural NetworksGPU computing perspective

Introduction

Yuhuang Hu Chu Kiong Loo

Advanced Robotic LabDepartment of Artificial IntelligenceFaculty of Computer Science & IT

University of Malaya

Yuhuang Hu, Chu Kiong Loo (UM) Intro DNNs 1 / 30

Outline

1 Introduction

2 Linear Algebra

3 Dataset PreparationData StandardizationPrinciple Components AnalysisZCA Whitening


Introduction

Outline

1 Introduction

2 Linear Algebra



Introduction

Objectives

â Light introduction of numerical computation.

â Fundamentals of Machine Learning.

â Softmax Regression.

â Feed-forward Neural Network.

â Convolutional Networks.

â Recurrent Neural Networks.


Introduction

Prerequisites

? Basic training in Calculus

? Basic training in Linear Algebra

Matrix operationsMatrix properties: transform, rank, norm, determinant, etcEigendecomposition, Singular Value Decomposition.

? Basic programming skills

If-else conditioningLoopsFunction, class, librarySource code control: Git (optional)


Introduction

References

© Deep Learning: An MIT Press book in preparationMain reference in this workshop, still in development, awesome structure,

awesome contents.

© Machine Learning: A probabilistic perspectiveOne of the best Machine Learning books on the market.

© CS231n Convolutional Neural Networks for Visual Recognition NotesNice structured, well written, loads of pictures.

© CS229 Machine Learning Course MaterialsFor basic knowledge, well written, easy-to-read.


Introduction

Software and Tools

? Ubuntu 14.04

? CUDA Toolkit 7

? Python 2.7.9 (Why not 3.*?)

? Theano

? numpy, scipy, etc

? Eclipse+PyDev


Introduction

Reading List

A reading list is prepared for this workshop, all reading materials can befound at:http://rt.dgyblog.com/ref/ref-learning-deep-learning.html

The list keeps expanding!!


http://rt.dgyblog.com/ref/ref-learning-deep-learning.html

Linear Algebra

Outline

1 Introduction

2 Linear Algebra



Linear Algebra

Scalar, Vector, Matrix and Tensor

Scalars A scalar is a single number.

Vectors A vector is an array of numbers.

Matrices A matrix is a 2-D array of numbers.

Tensors A tensor is an array of numbers arranged on a regular gridwith a variable number of axes.


Linear Algebra

Matrix Operations and Properties

Identity In ∈ Rn×n, IA = AI = A.

Transpose (A>)ij = Aji.

Addition C = A+B where Cij = Aij +Bij .

Scalar D = a ·B + c where Dij = a ·Bij + c.

Multiply C = AB where Cij =∑

k AikBkj .

Element-wise Product C = A�B where Cij = AijBij .

Distributive A(B +C) = AB +AC.

Associative A(BC) = (AB)C.

Transpose of Matrix Product (AB)> = B>A>.


Linear Algebra

Inverse Matrices

The matrix inverse of A is denoted as A−1, and it is defined as the matrixsuch that

A−1A = AA−1 = In

if A is not singular matrix.If A is not square or is square but singular, it’s still possible to find it’sgeneralized inverse or pseudo-inverse.


Linear Algebra

Norms

We usually measure the size of vectors using an Lp norm:

‖x‖p =

(∑i

|xi|p) 1

p

There are few commonly used norms

L1 norm‖x‖1 =

∑i

|xi|.

l∞ (Max norm)‖x‖∞ = max

i|xi|.

Frobenius norm: Measure the size of matrix, analogy to the L2 norm.

‖A‖F =

√∑i,j

A2ij


Linear Algebra

Unit vector, Orthogonal, Orthonormal, Orthogonal Matrix

A unit vector is a vector with unit norm:

‖x‖2 = 1

A vector x and y are orthogonal to each other if x>y = 0. In Rn, at most nvectors may be mutually orthogonal with nonzero norm. If vectors are not onlyorthogonal but also have unit norm, we call them orthonormal.An orthogonal matrix is a square matrix whose rows are mutually orthonormaland whose columns are mutually orthonormal:

A>A = AA> = I

This implies thatA−1 = A>


Linear Algebra

Eigendecomposition

An eigenvector of a square matrix A is a non-zero vector v such that

Av = λv

where scalar λ is known as the eigenvalue corresponding to the eigenvector. If vis an eigenvector of A, so is any rescaled vector sv for s ∈ R, s 6= 0. Therefore,we usually only look for unit eigenvectors.We can represent the matrix A using an eigendecomposition.

A = V diag(λ)V −1

where V = [v(1),v(2), . . . ,v(n)] (one column per eigenvector) andλ = [λ1, . . . , λn].Every real symmetric matrix can be decomposed into

A = QΛQ>

where Q is an orthogonal matrix composed of eigenvectors of A, and Λ is adiagonal matrix, with λii being the eigenvalues.Yuhuang Hu, Chu Kiong Loo (UM) Intro DNNs 15 / 30

Linear Algebra

Singular Value Decomposition (SVD)

SVD is a general purpose matrix factorization method that decompose an n×mmatrix X into:

X = UΣW>

where U is an n× n matrix whose columns are mutually orthonormal, Σ is aretangular diagonal matrix and W is an m×m matrix whose columns aremutually orthonormal. Elements alone the main diagonal of Σ are referred to asthe singular values. Columns of U and W are referred as the left-singular vectorsand right-singular vectors of X respectively.

Left-singular vectors of X are the eigenvectors of XX>.

Right-singular vectors of X are the eigenvectors of X>X.

Non-zero singular values of X are the square roots of the non-zeroeigenvalues for both XX> and X>X.


Linear Algebra

Trace, Determinant

The trace operator is defined as

Tr(A) =∑i

Aii

Frobenius norm in trace operator:

‖A‖F =√

Tr(A>A)

The determinant of a square matrix, denoted det(A) is a functionmapping matrices to real scalars. The determinant is equal to the productof all matrix’s eigenvalues.


Dataset Preparation

Outline

1 Introduction

2 Linear Algebra



Dataset Preparation

MNIST

MNIST dataset contains 60,000 training samples and 10,000 testingsamples (Lecun, Bottou, Bengio, & Haffner, 1998). Each sample is ahand-written digit from 0-9.


Dataset Preparation

CIFAR-10

CIFAR-10 dataset consists 60,000 images in 10 classes (Krizhevsky, 2009).There are 50,000 training images and 10,000 testing images.


Dataset Preparation

Dataset: How to split the data?

, If dataset has been splitted initially, please keep that way.

, If dataset comes as a whole, usually use 60% as trianing data, 20% ascross-validation and 20% as testing data.

, Another alternative is using 70% as training data, 30% as testingdata.


Dataset Preparation

Dataset: Cross validation

v Cross validation is used to keep track the performance of learningalgorithm (overfitting? underfitting?).

v Used to be a small portion of training dataset, randomly selected.

v Serve as testing data when testing data is not visible.

v Help to choose and adjust parameters of the learning model.


Dataset Preparation Data Standardization

Mean subtraction, Unit Variance

Let µ as mean image

µ =1

m

m∑i=1

x(i)

And then subtract mean image on every image:

x(i) = x(i) − µ

Let

σ2j =1

m

∑i

(x(i)j

)2and then

x(i)j =

x(i)j

σj

This is also called PCA whitening if it’s applied to rotated data, thevariance here can also be viewed as eigenvalues.Yuhuang Hu, Chu Kiong Loo (UM) Intro DNNs 23 / 30

Dataset Preparation Principle Components Analysis

PCA

Principle Components Analysis (PCA) is a dimension reduction algorithmthat can be used to significantly speed up unsupervised feature learningalgorithm. PCA will find a lower-dimensional subspace onto which toproject our data.Given a set of data X = {x(1), x(2), . . . , x(m)}, where x(i) ∈ Rn,covariance is computed by

Σ =XX>

m

Use eigendecomposition, we can obtain the eigenvectors U :

U =

| | · · · |u1 u2 · · · un| | · · · |

and corresponding eigenvalues Λ = {λ1, . . . , λn}. Note that usually Λ is asorted in descending order.



PCA

Columns in U are principle bases, so that we can rotate our data to thenew bases:

Xrot = UTX;

Noted that X = UXrot.To one of the data x, we can reduce the dimension by

x̃ =

xrot,1...

xrot,k

0...0

≈

xrot,1...

xrot,k

xrot,k+1...

xrot,n

= xrot

The resulted X̃ can approximate the distribution of Xrot since it’sdropping only small components. And we reduced n− k dimensions of theoriginal dataYuhuang Hu, Chu Kiong Loo (UM) Intro DNNs 25 / 30


PCA

We can recover the approximation of original signal X̂ from X̃:

x̂ = U

x̃1...x̃k0...0



PCA: Number of components to retain

To determine k, we usually look at the percentage of variance retained,

p =

∑ki=1 λj∑nj=1 λj

(1)

One common heuristic is to choose k so as to retain 99% of the variance(p ≥ 0.99).


Dataset Preparation ZCA Whitening

ZCA Whitening

ZCA whitening is just a recover version of PCA whitening:

xZCA = UxPCA = Uxrot√Λ + ε

where ε is a regularization term. When x takes values around [−1, 1], avalue of ε ≈ 10−5 might be typical.


Q&A

Q&A


Q&A

References

Krizhevsky, A. (2009). Learning multiple layers of features from tinyimages (Master Thesis).

Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998, Nov).Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11), 2278-2324.


practical deep neural networks - dgyrt.dgyblog.com/res/dlworkshop/introduction.pdf · practical...

Documents