lecture 3: math primer ii

Machine LearningAndrew Rosenberg

Lecture 3: Math Primer II

2

Today

• Wrap up of probability• Vectors, Matrices.• Calculus• Derivation with respect to a vector.

3

Properties of probability density functions

Sum Rule

Product Rule

4

Expected Values

• Given a random variable, with a distribution p(X), what is the expected value of X?

5

Multinomial Distribution

• If a variable, x, can take 1-of-K states, we represent the distribution of this variable as a multinomial distribution.

• The probability of x being in state k is μk

6

Expected Value of a Multinomial

• The expected value is the mean values.

7

Gaussian Distribution

• One Dimension

• D-Dimensions

8

Gaussians

9

How machine learning uses statistical modeling

• Expectation– The expected value of a function is the

hypothesis• Variance

– The variance is the confidence in that hypothesis

10

Variance• The variance of a random variable describes how

much variability around the expected value there is.• Calculated as the expected squared error.

11

Covariance

• The covariance of two random variables expresses how they vary together.

• If two variables are independent, their covariance equals zero.

12

Linear Algebra• Vectors

– A one dimensional array. – If not specified, assume x is a column

vector.• Matrices

– Higher dimensional array.– Typically denoted with capital letters.– n rows by m columns

13

Transposition

• Transposing a matrix swaps columns and rows.

14

Transposition

• Transposing a matrix swaps columns and rows.

15

Addition

• Matrices can be added to themselves iff they have the same dimensions.– A and B are both n-by-m matrices.

16

Multiplication• To multiply two matrices, the inner dimensions must

be the same.– An n-by-m matrix can be multiplied by an m-by-k matrix

17

Inversion

• The inverse of an n-by-n or square matrix A is denoted A-1, and has the following property.

• Where I is the identity matrix is an n-by-n matrix with ones along the diagonal.– Iij = 1 iff i = j, 0 otherwise

18

Identity Matrix

• Matrices are invariant under multiplication by the identity matrix.

19

Helpful matrix inversion properties

20

Norm

• The norm of a vector, x, represents the euclidean length of a vector.

21

Positive Definite-ness

• Quadratic form– Scalar

– Vector

• Positive Definite matrix M

• Positive Semi-definite

22

Calculus

• Derivatives and Integrals• Optimization

23

Derivatives

• A derivative of a function defines the slope at a point x.

24

Derivative Example

25

Integrals

• Integration is the inverse operation of derivation (plus a constant)

• Graphically, an integral can be considered the area under the curve defined by f(x)

26

Integration Example

27

Vector Calculus

• Derivation with respect to a matrix or vector

• Gradient• Change of Variables with a Vector

28

Derivative w.r.t. a vector

• Given a vector x, and a function f(x), how can we find f’(x)?

29

Derivative w.r.t. a vector

• Given a vector x, and a function f(x), how can we find f’(x)?

30

Example Derivation

31

Example Derivation

Also referred to as the gradient of a function.

32

Useful Vector Calculus identities

• Scalar Multiplication

• Product Rule

33

Useful Vector Calculus identities

• Derivative of an inverse

• Change of Variable

34

Optimization

• Have an objective function that we’d like to maximize or minimize, f(x)

• Set the first derivative to zero.

35

Optimization with constraints

• What if I want to constrain the parameters of the model.– The mean is less than 10

• Find the best likelihood, subject to a constraint.

• Two functions:– An objective function to maximize– An inequality that must be satisfied

36

Lagrange Multipliers

• Find maxima of f(x,y) subject to a constraint.

37

General form

• Maximizing:

• Subject to:

• Introduce a new variable, and find a maxima.

38

Example

• Maximizing:

• Subject to:

• Introduce a new variable, and find a maxima.

39

Example

Now have 3 equations with 3 unknowns.

40

ExampleEliminate Lambda Substitute and Solve

41

Why does Machine Learning need these tools?

• Calculus– We need to identify the maximum likelihood, or

minimum risk. Optimization– Integration allows the marginalization of

continuous probability density functions• Linear Algebra

– Many features leads to high dimensional spaces– Vectors and matrices allow us to compactly

describe and manipulate high dimension al feature spaces.

42

Why does Machine Learning need these tools?

• Vector Calculus– All of the optimization needs to be performed

in high dimensional spaces– Optimization of multiple variables

simultaneously – Gradient Descent– Want to take a marginal over high

dimensional distributions like Gaussians.

43

Next Time

• Linear Regression– Then Regularization

lecture 3: math primer ii

Documents