mlclass notes

Notes from Andrew Ng’s∗Machine Learning Course

Travis Johnson†

2012-06-14

1 Introduction

1.1 Welcome

Machine learning is

1. grew out of work in AI2. New capability for computers

Examples of Machine Learning

1. Database mining, since we have large datasets fromgrowth of automation/web (web click data, medicalrecords, biology, engineering)

2. Applications that can’t be programmed manually -autonomous helecopters/handwriting recognition/mostof natural language processing, computer vision

3. Self-customizing programs (Amazon, Netflix, etc)4. Understanding human learning: brain, real AI.

1.2 What is machine learning?

Machine learning is the field of study that gives computersthe ability to learn without being explicitly programmed.

More formally(From Tom Mitchell): Well-posed Learningproblem: A computer program is said to Learn from expe-rience E with respect to some task T and some performancemeasure P, if its performance on T , as measured by P ,improves with experience E.

Suppose your email program watches which emails youdo or do not mark as spam, and based on that learns howto better filter spam. What is the task T in this setting?Classifying emails as spam or not spam is T . Watching youlabel emails as spam or not spam is E, and the fraction ofcorrectly labeled emails is P .

Machine learning algorithms are supervised and un-supervised learning, and reinforcement learning, andrecommender systems. We’ll also talk about practical usageof these algorithms.

1.3 Supervised Learning

Let’s say you want to predict housing problems. Giventhe plot 6/14/2012 +1743, could put a plot through itor quadratic or something. This is a supervised learningalgorithm since we gave it the “right answers”, which inthis case was the house prices. This is a regression problem,since we predict continuous valued output, the price.

Another example: Breast cancer–maybe we want topredict malignnant vs benign, for various tumor sizes. (plot1746) this is a CLASSIFICATION problem, since we tryto predict zero or one. We are not limited to 2 classes–wecould have multiclass as well.

We can also extend this into having multiple variables:maybe age, and tumor size. We might even have tons more!Maybe even infinite variables?

1.4 Unsupervised Learning

Here, our datasets do not include the labels! So, we justlook for structure in the data. We might just try to cluster

∗All notes from https://www.coursera.org/course/ml†Department of Engineering Science and Applied Mathematics,

Northwestern University, Evanston, IL, USA.

data: This is how google News works, by clustering newsarticles together. Some other applications:

1. Organize computing clusters2. Social Network analysis3. Market Segmentation4. Astronomical data analysis

Cocktail party problem: Put two microphones in a roomwith two speakers talking at the same time. Cocktail partyalgorithm will split the voices out. Can be done with oneline of code(1).

Listing 1: Cocktail Party Algorithm

[W, s , v ] = svd ( ( repmat (sum( x .∗ x , 1 ) , . . .s ize (x , 1 ) , 1 ) . ∗ x )∗x ’ ) ;

We’ll use Octave–basically the same as MATLAB. Typ-ically we prototype in Octave and then implement thingsproperly in another language.

2 Linear Regression in One Variable

2.1 Model Representation

We’ll use the Portland, OR training set: Price(in 1000sof dollars) vs Size(in feet square). This is a supervisedlearning example because we have the “right answer”, andalso an example of regression.

More formally, we have a dataset:

1. m - number of training examples(47 in Portland, ORexample.)

2. x’s - “input” variable/features3. y’s - “ouput” variable/target variable.

We will use (x, y) to denote one training example, and(x(i), y(i)) to denote the ith training example.

The hypothesis is a function h that maps from x’s to y’s.The first thing we’ll need to decide: How do we representh? For this problem,

hθ(x) = θ0 + θ1x (1)

or h(x) in shorthand–this function predicts some straightline function into the data points, and is known as Linearregression with one variable or univariate linear regression.

2.2 Cost Function

We call θi as the parameter. We’ll need to determine θ0

and θ1 to evaluate things. We’d like to come up with valuesθi that correspond to a good fit. Therefore, we’ll pick θisuch that hθ(x) is “close” to y for our training examples(x, y). We can formulate this as a minimization problem:

minθ0,θ1

1

2m

m∑i=1

(hθ(x(i))− y(i))2 (2)

where m is the number of training examples and 2m is justa normalization factor to make the math easier.

1

Typically we redefine this as: the cost function J(θ0, θ1)defined as

J(θ0, θ1) =1

2m

m∑i=1

(hθ(x(i))− y(i))2 (3)

and then solve the minimization problem

minθ0,θ1

J(θ0, θ1). (4)

2.3 Cost function Intuition

Consider a simplified cost function with hθ(x) = θ1x (or,θ0 = 0). For fixed θ1, hθ(x) is a function of f , while J(θ1)is a function of the parameter θ1.

We can plot J(θ1), which gives (plot 1829), derived fromplotting the error J(θ1) for a bunch of different values ofθ1. Furthermore, we want to use thet θ1 that correspondsto the minimum error, so in this case we pick θ1 = 1.

2.4 Cost function Intuition II

If we instead consider the 2d version, we need to considerboth parameters. We can do this with a contour plot.

(See plot 1833)This isn’t very important though–we can’t really use plots

to find optimum in higher dimensions–we need an algorithm.

2.5 Gradient Descent

Gradient descent is a super sweet algorithm for findingminimum of our cost function J . Our problem is

minθ0,θ1

J(θ0, θ1) (5)

where we can initialize θ0 = 0 and θ1 = 0 and keep changingthese θi to make J lower.

Intuition is that we look locally around the point we’re at,and take a tiny step in a descent direction. One importantpoint of this is that it is possible to find local minima that arenot global minima, depending on which points we start at!

Formalizing this definition a bit, we have that we willrepeat until convergence:

θj := θj − α∂

∂θjJ(θ0, θ1) ( for j = 0, j = 1). (6)

The parameter α is called the Learning Rate. We’ll saymore about it. It is important that this update must besimultaneous.

2.6 Gradient Descent Intuition

We call∂

∂θjJ(θ0, θ1) (7)

the derivative term. Suppose we just wanted to solve

minθ1

J(θ1) (8)

Then gradient descent will give

θ1 := θ1 − αd

dθ1J(θ1) (9)

The derivative is a tangent to the function, which meansthat if at the point we’re evaluating the cost functionJ(θ1) has a positive (negative) slope, then the update will

decrease (increase) θ1, as we’d expect–in both cases, wemove toward the minimum.

Now consider α: If α is too small, then gradient descentwill be too slow. On the other hand, if α is too large, thengradient descent will overshoot the minimum–and may evenfail to converge or diverge.

What will happen if we initialize gradient descent AT alocal minimum? Well, the derivative will be zero then, sothe update won’t move the iterate! This has the importantproperty that gradient descent will automatically takesmaller steps, so we have no need to decrease α over time.

2.7 Gradient Descent for Linear Regression

Now we need to apply gradient descent to the linear regres-sion idea, where we had hθ(x) = θ0 + θ1x and associatedcost function J(θ0, θ1) = 1

2m

∑mj=1(hθ(x

(i)) − y(i))2. We’dlike to solve the problem

minθ0,θ1

J(θ0, θ1) (10)

via the gradient descent algorithm, where we repeat untilconvergence

θj := θj − α∂

∂θjJ(θ0, θ1) for j = 0, j = 1. (11)

The huge key to applying gradient descent, we need tocalculate the derivative term

∂

∂θjJ(θ0, θ1) =

∂

∂θj· 1

2m

m∑j=1

(hθ(x(i))− y(i))2 (12)

=∂

∂θj· 1

2m

m∑j=1

(θ0 + θ1x(i) − y(i))2 (13)

=1

m

m∑i=1

(hθ(x(i))− y(i))

{1 j = 0

x(i) j = 1.(14)

Then the gradient descent algorithm is to repeat untilconvergence(

θ0

θ1

):=

(θ0

θ1

)− α 1

m

m∑i=1

(hθ(x(i))− y(i))

(1

x(i).

)(15)

It turns out that for the cost function used with linearregression, we will always have convex functions, so gradientdescent will always converge to a global optimum.

This is also known as Batch gradient Descent–the namecomes from that each step of gradient descent use all thetraining examples.

This will scale better than the linear algebraic answers.

2.8 What’s next?

Two extensions:

1. Can solve θ0 and θ1 directly, without needing an iterativealgorithm.

2. We might also use a bunch more features! In this case, weintroduce x1, x2, x3, and x4 with y still being the “tar-get”. We won’t be able to plot these, though. We use fea-ture matrices X, where the row is the training data pointand the column is the feature, and Y is the target vector.

2

3 Linear Algebra Review

3.1 Matrices and Vectors

A Matrix is a rectangular array of numbers: 1402 1911371 821949 1437

147 ∗ 1448

or

(1 2 34 5 6

)(16)

The dimension of the matrix is the number of rows ×number of columns. The first matrix is R4×2, for example.We call Aij to be the i, j entry of the matrix and refers tothe ith row and the jth column.

A vector is a special matrix: An n× 1 matrix. We mightsay a vector is an element in Rn, and yi is the ith row of avector. Careful here! Some people use 0-indexed, we (andOctave/MATLAB) use 1-indexed.

Usually, people use upper case to refer to matrices andlower case to refer to vectors and scalars.

3.2 Addition and Scalar Multiplication

Matrices can be added like: Aij +Bij = (A+B)ij . We canonly add matrices of identical sizes.

Scalar multiplication is defined as α(Aij) = (αAij).We can take combinations of operands, and just follow

typical order of operations.

3.3 Matrix-Vector Multiplication

To multiply a m×n matrix A by an n×1 vector x (resultingin an m× 1-vector y), we have that

n∑j=1

Aijxj = yi (17)

We can evaluate hyptotheses (say h(x) = −40 + .25x1) atsome vector x all at once by a matrix-vector multiplication

by introducing (for some data vector y =

210414161534852

) by

introducing the matrix1 21041 14161 15341 852

=⇒

1 21041 14161 15341 852

× (−400.25

)= (18)

which says that the prediction is just the datamatrix timesthe parameters θ! Neat!

3.4 Matrix-Matrix Multiplication

Similarly, we can define the product of a m × n-matrix Aby a n× o matrix B into a m× p result matrix C = A×B:

Cmo = (A×B)mo =∑j

AmjBjo (19)

We can apply multiple competing hypotheses by ma-trix multiplication where the data matrix is as the lastsection and the columns of the hypothesis matrix are theparameters.

USE LINEAR ALGEBRA LIBRARIES for this!

3.5 Matrix Multiplication Properties

We should be careful with matrix multiplication, though!There are several sticking points:

1. If A, B are matrices, then in general A × B 6= B × A!(Matrices are not commutative!)

2. Matrices are associative: A × B × C = (A × B) × C =A× (B × C)–but, careful about orders!

3. Special Matrix: I is the identity matrix, and for anymatrix A, A · I = I · A = A. (This only makes sense ifI is square.)

4.

3.6 Inverse and Transpose

Some matrices have inverses! If A is an m×n matrix and ifit has an inverse, then A(A−1) = A−1A = I. (Only squarematrices can have inverses.) Matrices without inverses are

Listing 2: Matrix Inverses in Octave

A = [ 3 , 4 ; 2 1 6 ] ;inverseOfA = pinv (A) ;A∗ inverseOfA % = I , to round−o f finverseOfA ∗A % = I , to round−o f f

called “singular” or “degenerate”.

We can also define the transpose of a matrix, where(AT )ij = Aji. Intuitively, it “flips” the elements of thematrix.

(Many numerical examples were excluded from thesenotes.)

4 Linear Regression with Multiple Variables

4.1 Multiple Features

Suppose that in addition to size, we also had the numberof bedrooms, number of floors, age of home, and still wantto predict the price. Then introduce the notation

1. m - training examples2. n - number of features3. x(i) - input (features) of ith training example

4. x(i)j - value of feature j in the ith training example

Previously, we had hθ(x) = θ0 + θ1x. We’ll need to insteadmake this:

hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4 (20)

For convenience, we define x0 = 1, and then take

hθ(x) = θTx. (21)

where x, θ ∈ Rn+1.

This is called Multivariate Linear Regression.3

4.2 Gradient Descent for Multiple Variables

We’ll start thinking of θ being an (n + 1)-dimensionalvector, so that we can write

J(θ) =1

2m

m∑i=1

(hθ(x(i))− y(i))2 (22)

=1

2m

m∑i=1

(hθ(x(i))− y(i))2 (23)

=1

2m

m∑i=1

(θTx(i) − y(i))2 (24)

=1

2m

m∑i=1

(

n∑j=0

θjx(i)j − y

(i))2 (25)

(26)

Again, we have that gradient descent is of the form: repeat

θj := θj−α∂

∂θjJ(θ) := θj−α

1

m

m∑i=1

(hθ(x(i))−y(i))x

(i)j (27)

where we do simultaneous updates for θj for j = 0, ..., n.

4.3 Gradient Descent in Practice I - FeatureScaling

Gradient descent will converge more quickly if we scalevariables so that everything is on a similar scale. Forexample, if x1 is in the range 0-2000 and x2 is in the range1− 5, we will have poor performance, due to oscillation. Sowe scale so that steps will be able to be much larger. Forexample, we might define:

1. x1 = size(feet2)2000

2. x2 = numberofbedrooms5 .

More generally, we hope to get every feature into approx-imately a −1 ≤ xi ≤ 1. (We only care about scaling toabout order-1). More generally, we might also do meannormalization! There we replace xi by xi − µi, or xi−µi

si.

This will make gradient descent run much faster inpractice.

4.4 Gradient Descent in Practice II - LearningRate

We will need to make sure that gradient descent will workproperly, at least in terms of the α value.

One key idea: Plot minθ J(θ) vs the number of iterations.This should very clearly be a decreasing function! In fact,it should decrease after every iteration. We can also tellfrom this figure when the iteration has converged.

We can also use this idea to find an automatic con-vergence test! We might be able to say if minJ(θ) onlydecreases by 10−3 then we are probably done.

If we see an increasing or even nondecreasing function,we might need to pick a smaller α. It is possible to provethat for sufficiently small α, we are guaranteed to convergeand J(θ) will decrease on every iteration., but also want tobe careful that the algorithm is not converging too slowly.

Practically, just run several values of α like0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ... and make sure that wefound one that is too slow and one that is too large. Thenwe know that we have found a good learning rate to use.

4.5 Features and Polynomial Regression

We might want to model housing prices via frontage anddepth. With linear regression, we can define new variables:one example is to take x as the product of frontage anddepth. So by defining new features, we might get a betterfit for our data. For example, we might pick a quadratic orcubic model. We can actually do this incredibly easily withthe framework we’ve created, just by changing our hθ(x)!One easy way to do this is for a cubic model is to createnew data where x1 is the size, x2 is the squared size ofthe house, and x3 is the size cubed of the house. But thenscaling gets REALLY important!

We can also try totally crazy things, like hθ(x) =

θ0 + θ1(size) + θ2

√size.

4.6 Normal Equation

Gradient descent might be subpar since we can use thenormal method to solve for θ analytically.

The intuition is that if we had something like J(θ) =aθ2 + bθ + c and if we want to minimize, then we shouldtake the derivative and set it equal to zero, then solve for θ:

d

dθJ(θ) = ... = 0 (28)

In the case of multiple dimensions, we will get manyequations. Suppose that θ ∈ Rn+1, and that

J(θ0, θ1, ..., θm) =1

2m

m∑i=1

(hθ(x(i))− y(i))2 (29)

we can still take the derivative(except that it’s now thegradient) and equal to zero:

∂

∂θjJ(θ) = ... = 0 for every j (30)

and now solve for θ0, θ1, ..., θn.To start, we create the data matrix:

X =

1 2104 5 1 451 1416 3 2 401 1534 3 2 301 852 2 1 36

y =

460232315178

(31)

Then we can find that the analytic solution ends up being

θ = (XTX)−1XT y. (32)

In the general case, consider that we have m examples(x(i), y(i)) each of which has n features. Since

x(i) =

x

(i)0

x(i)1...

x(i)n

(33)

we create a design matrix X as:

X =

· · · (x(1))T · · ·· · · (x(2))T · · ·

...· · · (x(m))T · · ·

(34)

4

Gradient descent normal equations+ need to choose α no need to choose α

Needs many iterations! don’t need to iterate!- works well even when need to compute (XTX)−1

n is large slow if n is large!

Table 1: Tradoffs between Gradient Descent and NormalEquations

which is an m × (n + 1) matrix. As always, y is the vectorof “target” values (sale prices in our example). Then:

θ = (XTX)−1XT y (35)

We could implement this (shittily!) in

Listing 3: Octave Normal Equations

pinv (X’∗X)∗X’∗ y

We talked last time about feature scaling–this is notnecessary for the normal equation method!

A quick, hard, 2012 rule-of-thumb: We would probablyuse normal equations up to 1000-10000 or so variables, andthen switch gradient descent.

4.7 Normal Equation Noninvertibility

One important note about Normal Equation is that thenormal equation can be noninvertible! What if (XTX)−1

is non-invertible/singular/degenerate?When will this happen?

1. Redundant features–linearly dependent!2. Too many features(ie, m ≤ n(trying to fit 100 parame-

ters with 10 data points...))–delete some features or useregularization

5 Octave Tutorial

5.1 Basic Operations

Can use basic operations: +, −, ∗, /, and ˆ.Comparisons are typically: ==, =, &&, and || or xor.Can overwrite prompt with: PS1(′>>′);Semicolons suppress output.Can use the disp command to display things or use sprintf.Can also use 1 : 0.1 : 2 notation, or command ones or

zeros or rand (uniform from 0 to 1) or randn (normallydistributed) or eye (identity matrix). All of these can makean arbitrary size.

5.2 Moving Data Around

How do you save and load data in octave? Use commandssave (maybe with ‘-ascii’), load, size, length, pwd, who,whos, clear.

Might use subset notation: A(3, 2) returns the 3, 2-element of the matrix, A(2, :) returns every element alongsecond row, A([13], :) returns every column of first and thirdrows. You can use this notation for assignments or accesses.

Can also put all things into a single vector by A(:)or concatenate matrices by C = [AB] (horizontal) orC = [A;B] (vertical).

5.3 Computing on Data

Some important computational operations, as shownin listing (4), are initializing matrices, multiplication,elementwise multiplication,

Listing 4: Computing on Data

A=[1 2 ; 3 4 ; 5 6 ] ;B=[11 12 ; 13 14 ; 15 1 6 ] ;C = [ 1 1 ; 2 2 ] ;A∗CA .∗ BA . ˆ 2v = [ 1 ; 2 ; 3 ] ;1 . / v1 ./Alog ( v )exp( v )abs ( v )−v % same as −1∗vv + ones ( length ( v ) , 1 )v + 1 % s imp l i e r wayA’ % transpose(A’ ) ’a=[1 14 2 0 . 5 ]va l = max( a )[ val , ind ]=max( a )a < 3 % boolean vecfind ( a<3) % ind i c e sA = magic (3 )[ r , c ] = find (A>=7)help findsum( a )prod ( a )f loor ( a )ce i l ( a )rand (3 )max(rand ( 3 ) , rand ( 3 ) )max(A, [ ] , 1 ) % along f i r s t dimensionmax(A, [ ] , 2 ) % along second dimensionmax(max(A) )max(A( : ) )A=magic (9 )sum(A, 1 )sum(A, 2 )sum(sum(A.∗ eye ( 9 ) ) )sum(sum(A.∗ fl ipud (eye ( 9 ) ) ) )A=magic (3 )Ainv=pinv (A)A∗Ainv % i d e n t i t y

5.4 Plotting Data

Plotting cost function can help ensure that learningalgorithm is converging.

5.5 Control Statements: for, while, and if state-ments

5.6 Vectorization

Typically languages have numerical linear algebra librarieswhich are much more efficient than writing your ownroutine for (eg) matrix multiplication.

5

Listing 5: cap

t = [ 0 : 0 . 0 1 : 0 . 9 8 ]y1=sin (2∗pi ∗4∗ t ) ;y2=cos (2∗pi ∗4∗ t ) ;plot ( t , y1 ) ;plot ( t , y2 ) ;plot ( t , y1 ) ; hold onplot ( t , y2 ) ;xlabel ( ’ time ’ ) ;ylabel ( ’ va lue ’ ) ;legend ( ’ s i n ’ , ’ cos ’ ) ;t i t l e ( ’my p lo t ’ ) ;print −dpng ’ myPlot . png ’ ;help plotclosefigure ( 1 ) ; plot ( t , y1 ) ;f igure ( 2 ) ; plot ( t , y2 ) ;subplot ( 1 , 2 , 1 ) ; %d i v i d e s p l o t i n t o 1x2 gr id , acces s f i r s t e lementplot ( t , y1 ) ;subplot ( 1 , 2 , 2 ) ;plot ( t , y2 ) ;axis ( [ 0 . 5 1 −1 1 ] ) ;c l fA=magic ( 5 ) ;imagesc (A) ;imagesc (A) , colorbar , colormap gray

Listing 6: for loops

v=zeros (10 ,1 )for i =1:10

v ( i )=2ˆ i ;end

Listing 7: cap

i =1;while i<=5;v ( i ) = 100 ;i=i +1;end ;

Consider

hθ(x) =

n∑j=0

θjxj = θTx (36)

We can implement this two ways, shown in (13) Whatwould this look like in C++? See (14) Let’s consider amore sophisticated example. If we have

θj := θj − α1

m

m∑i=1

(hθ(x(i))− y(i))x

(i)j (37)

we can implement as θ := θ − αδ.5.7 Working on and Submitting programming

exercises

The homework submission system here is pretty slick. Justrun the submit script.

Listing 8: cap

i =1;while t rue ;

v ( i )=999;i = i +1;i f i ==6;

break ;end ;

end ;

Listing 9: cap

i f v(1)==1,disp ( ’ the va lue i s one ’ ) ;

e l s e i f v ( i )==2,disp ( ’ the va lue i s two ’ ) ;

elsedisp ( ’ the va lue i s not one or two ’ ) ;

end

Listing 10: cap

function y = squareThisNumber ( x ) ;y=x ˆ2 ;

Listing 11: cap

function [ y1 , y2 ] = squareAndCubeThisNumber ( x ) ;y1=x ˆ2 ;y2=x ˆ3 ;

Listing 12: cap

function J = costFunct ionJ (X, y , theta )m=s ize (X, 1 ) ;p r e d i c t i o n s = X∗Theta ;sq rEr ro r s = ( p r e d i c t i o n s − y ) . ˆ 2J = 1/(2∗m)∗sum( sq rEr ro r s ) ;

Listing 13: vectorized

%% unvec to r i z ed ;p r e d i c t i o n = 0 . 0 ;for j =1:n+1;p r e d i c t i o n = p r e d i c t i o n + theta ( j )∗x ( j ) ;end%% vec t o r i z e dp r e d i c t i o n = theta ’∗ x ;

Listing 14: cap

// unvec to r i z eddouble p r e d i c t i o n = 0 . 0 ;for ( i n t j =0; j<= n ; j++)

p r e d i c t i o n += theta [ j ]∗ x [ j ] ;// v e c t o r i z e ddouble p r e d i c t i o n = theta . t ranspose ( ) ∗ x ;

6

6 Logistic Regression

6.1 Classification

We’ll develop Logistic regression. Some examples:

1. Email: Spam/not spam2. Online transactions: Fradulent (Yes/no)3. Tumor: malignant/benign.

We say here thaty ∈ {0, 1} (38)

where 0 is the negative class and 1 is the positive class(eg,benign and malignant, respectively).

Can also consider a multi-class algorithm–later for us.Consider the malignant example. One idea: Threshold

classifier output hθ(x) at 0.5, so if hθ(x) ≥ 0.5, predicty = 1 and otherwise y = 0. Linear regression is kinda sillyhere, because we would generate pretty bad predictions forfairly simple cases. Applying linear regression is a bad ideain general.

Here’s another funny thing: The classification wants tolabel y = 0 and y = 1, but hθ(x) can output arbitrarilypositive/negative things. So it seems like linear regressionis a really bad idea.

So we’ll develop logistic regression which satisfies0 ≤ hθ(x) ≤ 1. Don’t be confused by “regression” title–it’sjust a name, and it is really classification!

6.2 Hypothesis Representation

Earlier we said that we wanted 0 ≤ hθ(x) ≤ 1. Now wedefine

hθ(x) = g(θTx)1

1 + e−θT xg(z) =

1

1 + e−z(39)

which is the logistic function, which happens to be asigmoid function. See plot (2118).

We’ll take hθ(x) to be the estimated probability thaty = 1 on input x. That is, hθ(x) = p(y = 1|x; θ)–“probability that y=1, given x, parameterized by θ”. Bybasic probability laws, P (y = 0|x; θ) +P (y = 1|x; θ) = 1, sowe can just subtract from 1 to get other probabilities.

6.3 Decision boundary

We decided that p(y = 1|x; θ), so we can kindof imaginepredicting

1. y = 1 if hθ(x) ≥ 0.52. y = 0 if hθ(x) < 0.5

Clearly, g(z) ≥ 0.5 when z ≥ 0, so hθ(x) = g(θTx) wheneverθT ≥ 0. By similar argument, we predict y = 0 whenθTx < 0.

(plot: 2123) If we have a nice enough classificationproblem, we might be able to easily come up with θ.Suppose it might end up being:

hθ(x) = g(θ0 + θ1x1 + θ2x2) (40)

so we say that we predict y = 1 if −3 + x1 + x2 ≥ 0, orequivalently x1 +x2 ≥ 3. We’ll predict y = 0 if x1 +x2 < 3.The line where x1 + x2 = 3 is therefore the decisionboundary for this problem.

It’s clear that the decision boundary is determined bythe hypothesis and NOT the training set.

We can also have nonlinear decision boundaries! Plot2127. We can introduce

hθ(x) = g(θ0 + θ1x1 + θ2x2 + θ3x21 + θ4x

22) (41)

then we would predict y = 1 if −1 + x21 + x2

2 ≥ 0 andtherefore we should predict y = 1 if x2

1 + x22 ≥ 1 so the

circle x21 + x2

2 = 1 is the decision boundaries.Once again, the decision boundaries is a property of the

hypothesis, NOT the dataset! The dataset is used to findthe hypothesis, but the decision boundary itself dependsonly on the hypothesis.

We can come up with crazy nonlinear decision boundariesto fit pretty much anything.

6.4 Cost Function

Okay, given a training set of (x(i), y(i)), i ∈ {1, 2, ...,m}, wehave m examples x ∈ Rn+1 with x0 = 1 and y ∈ {0, 1}, andthe hypothesis function

hθ(x) =1

1 + exp(−θTx), (42)

how would we choose parameters θ?Now, introduce the Cost function:

Cost(hθ(x), y) =1

2(hθ(x)− y)2 (43)

and then

J(θ) =1

m

m∑i=1

cost(hθ(x(i)), y(i)) (44)

But this is then a nonconvex function! The sigmoid endsup messing us up. So we’ll instead use the cost function

Cost(hθ(x), y) =

{− log(hθ(x)), ify = 1

− log(1− hθ(x)), ify = 0.(45)

Then we have a convex function again! Hooray!This has some intuitive happiness: Cost is zero if y = 1,

and hθ(x) = 1. But as hθ(x) = 0 then Cost → ∞.–thiscaptures the intuition that y = 1 should have a huge costfor being wrong, which is important in certain applications!

Now if we plot the function for when y = 0 then− log(1−hθ(x)). This cost function blows up as hθ(x)→ 1.

6.5 Simplified Cost Function and Gradient Descent

There is a bit simpler way of writing the cost function. Bythe end, we’ll have a full gradient descent implementation.Because either y = 0 or y = 1, we can remove the cases bywriting it into one equation. In particular,

cost(hθ(x), y) = −y log(hθ(x))− (1−y) log(1−hθ(x)) (46)

This reduces to the previous cases if y = 0 or y = 1, so ourequation is now much simpler: No cases!

Then we can write the logistic regression cost function:

J(θ) =1

m

m∑i=1

Cost(hθ(x(i)), y(i))

(47)

= − 1

m

m∑i=1

y(i) log hθ(x(i)) + (1− y(i)) log(1− hθ(x(i)))

(48)

(can be derived by principle of maximum likelihoodestimation, and is convex!)

7

Then to fit parameters θ, we just try

minθJ(θ) (49)

and to make a prediction given a new x, we just output

hθ(x) =1

1 + exp(−θTx). (50)

We can think of this as outputing the probabilityp(y = 1|x; θ).

Gradient descent will keep working as usual: We repeat

θj := θj − α∂

∂θjJ(θ) (51)

We will find again that

θj := θj − αm∑i=1


(i)j (52)

where we simultaneously update ∀θj . This algorithm looksidentical! What’s changed? Well, hθ(x) is different.

We want to apply the same rule for logistic regressionconvergence checking. Ideally we want to use a vectorizedimplementation.

We would probably use feature scaling again.

6.6 Advanced Optimization

Here’s an alternative view: Given θ, we have code that cancompute J(θ) and ∂

∂θjJ(θ) for j = 0, 1, ..., n.

Other algorithms: Gradient Descent, Conjugate gradient,BFGS, and L-BFGS. Some advantages:

1. No need to manually pick α.2. Often faster than gradient descent.

One huge disadvantage: More complex.Ng’s advice: Possible to use algorithms without really

understanding them. We can implement these using librarycalls.

One example: Suppose that θ =

(θ1

θ2

)with

J(θ) = (θ1 − 5)2 + (θ2 − 5)2, and then

∂

∂θ1J(θ) = 2(θ1 − 5)

∂

∂θ2J(θ) = 2(θ2 − 5) (53)

Then we can write the code as shown in (15) and (16).

Listing 15: Cost Function

function [ jVal , gradient ]= costFunct ion ( theta )

jVal = ( theta (1)−5)ˆ2 + . . .( theta (2)−5)ˆ2;

gradient = zeros ( 2 , 1 ) ;gradient (1)=2∗( theta (1)−5);gradient (2)=2∗( theta (2)−5);

We’ll need to change this code for it to work with logisticregression(or even just Linear Regression!), but that willjust be a similar costFunction.

Listing 16: Minimization

opt ions=opt imset ( ’ GradObj ’ , . . .’ on ’ , ’ MaxIter ’ , ’ 100 ’ ) ;

i n i t i a l T h e t a = zeros ( 2 , 1 ) ;[ optTheta , funcVal , eFlag ] . . .

= fminunc ( @cost funct ion , . . .i n i t i a l T h e t a , opt ions )

6.7 Multiclass Classification: One-vs-all

Some multiclass classification things:

1. Email foldering/tagging: Work, friends, family, hobby2. Medical diagrams: not ill, cold , flu3. Weather: sunny, cloudy, rain, snow.

For binary classification, we can hopefully separate with ahyperplane. Multiclass classification is not really separableby hyperplanes, so we use an idea called one-vs-all classi-fication. Here we take two new classes: one is the old firstclass, the second is all other datapoints. We do this foreach class. This gives us three new hypothesis functions:

1. h1θ(x) which tries to estimate P (y = 1|x; θ) which is the

probability of it being in that set.2. h2

θ(x) which tries to estimate P (y = 2|x; θ) which is theprobability of it being in that set.

3. h3θ(x) which tries to estimate P (y = 3|x; θ) which is the

probability of it being in that set.

So then we just train all three classifiers and evaluate eachof them on the test data; then pick the one with the highestprobability returned since it is the most likely class.

-6 + y ¿= 0 y¿= 6

7 Regularization

7.1 The PRoblem of Overfitting

1. Underfit - “High Bias” - high preconception that housingprices will be linear (warning: high training error, highgeneralization error)

2. Just fit3. Overfit - “High Variance” - not enough data (warning:

low training error, high generalization error))

Overfitting fits the training set very well but fails togeneralize to other examples.

If overfitting:

1. Reduce number of features: Either manually selectingwhich features to keep or model selection algorithms

2. Regularization: Keep all features, but reduce magni-tude/values of parameters θj ; works well when we havelots of features.

7.2 Cost Function

Intuition: If we wanted to penalize θ3 and θ4 really small.This gives us (again) a (basically) quadratic fit. Smallvalues of θi are essentially a “simpler hypothesis” which isless prone to overfitting.

In regularization, we will add:

J(θ) =1

2m

m∑i=1

(hθ(x(i))− y(i))2 + λ

n∑j=1

θ2j

(54)

the regularization parameter λ balances the fit with thehypothesis fit. Picking a λ too large will essentially give

8

us a constant hypothesis–a massive underfit, or too high ofbias! So some care should be given to pick λ is a good way.

7.3 Regularized Linear Regression

For regularized Linear regression, we first rewrite:

θ0 :=θ0 − α1

m

m∑i=1


(i)0 (55)

θj :=θj − α

[1

m

m∑i=1


(i)j +

λ

mθj

](56)

For the normal equation approach, we created the designmatrix X and target vector y, and then found that theminimum of J(θ) occurs at

θ = (XTX + λ

0

11

. . .1

)−1XT y (57)

Last time we talked about non-invertibility. If we have thatm ≤ n, then XTX is non-invertible, or singular, then prob-ably pinv will not give a super sweet solution. But it turnsout that XTX+λI is never singular, for sufficiently large λ.

7.4 Regularized Logistic Regression

Last time we saw that logistic regression can be prone tooverfitting, so we can modify it to have regularization:

J(θ) = −[ 1

m

m∑i=1

y(i) log hθ(x(i))+ (58)

(1− y(i)) log(1− hθ(x(i)))]+ (59)

+λ

2m

n∑j=1

θ2j (60)

Now we do the same trick for gradient

θ0 :=θ0 − α1

m

m∑i=1


(i)0 (61)

θj :=θj − α

[1

m

m∑i=1


(i)j +

λ

mθj

](62)

8 Neural Networks: Representation

8.1 Non-linear hypothesis

Currently Neural Nets are the state-of-the-art. We willuse neural networks for nonlinear classification tasks,particularly with a large number of features.

Including all quadratic terms is not a viable idea–featureswould scale like O(n2) which would be prone to overfitting.Including all cubic terms would of course be even worse!

Computer vision is one very hard case. Suppose we have50x50 pixel images, so we have n = 2500 features withgrayscale or 7500 for RGB. Quadratic features then wouldbe on the order of 3 million. Making 100x100 pixels withquadratic features gives 50 million features!

8.2 Neurons and the Brain

Neural networks are a pretty old algorithm. Origins werein algorithms that tried to mimic the brain, which startedin 80s-90s, but popularity diminished in the late 90s. We’reexperiencing a resurgence, though!

We take the “one learning algorithm” hypothesis–thebrain can rewire itself to process sights or sounds or touch.So instead of trying to come up with algorithms for everytask we would want to do, we want to come up with thelearning algorithm! Some evidence for this:

1. seeing with your tongue2. human echolocation3. haptic belt for direction sense4. implanting a third eye

8.3 Model Representation I

Basically, neurons take input through the dendrites andoutputs via axons. They communicate with spikes ofelectricity. Neurons move these spikes.

We simulate this with a much simpler model: a logisticunit which takes a bunch of inputs. (figure 1245). Some-times we include a “bias” unit where x0 = 1. We call thesigmoid/logistic an “activation” function. We also mightcall the parameters “weights”. Neural networks just stringtogether bunches of these models.

Assembled as follows: Layer 1: input one. Layer 2:Hidden layer. Layer 3: Output layer. We denote

1. a(j)i - activation of unit i in layer j.

2. Θ(j) - matrix of weights controlling function mappingfrom layer j to layer j + 1.

For 3 nodes of input, 3 nodes of hidden, and one output,we have the equations:

a(2)1 =g(Θ

(1)10 x0 + Θ

(1)11 x1 + Θ

(1)12 x2 + Θ

(1)13 x3) (63)

a(2)2 =g(Θ

(1)20 x0 + Θ

(1)21 x1 + Θ

(1)22 x2 + Θ

(1)23 x3) (64)

a(2)3 =g(Θ

(1)30 x0 + Θ

(1)31 x1 + Θ

(1)32 x2 + Θ

(1)33 x3) (65)

a(3)1 =g(Θ

(2)10 a

(2)0 + Θ

(2)11 a

(2)1 + Θ

(2)12 a

(2)2 + Θ

(2)13 a

(2)3 ) (66)

where we note that clearly hΘ(x) = a(3)1 .

Then Θ(1) ∈ R3x4. If network has sj units in layer j and

sj+1 units in layer j + 1, then Θ(j) will be of dimensionssj+1 × (sj + 1).

8.4 Model Representation II

We need a vectorized implementation–as we’ve written itit’s a little unwieldy.

We’ll introduce some notation:

z(2)1 =Θ

(1)10 x0 + Θ

(1)11 x1 + Θ

(1)12 x2 + Θ

(1)13 x3 (67)

z(2)2 =Θ

(1)20 x0 + Θ

(1)21 x1 + Θ

(1)22 x2 + Θ

(1)23 x3 (68)

z(2)3 =Θ

(1)30 x0 + Θ

(1)31 x1 + Θ

(1)32 x2 + Θ

(1)33 x3 (69)

and also that

a(2)1 =g(z

(2)1 ) (70)

a(2)2 =g(z

(2)2 ) (71)

a(2)3 =g(z

(2)3 ) (72)

9

x1 x2 hΘ(x)0 0 g(−30) ≈ 00 1 g(−10) ≈ 01 0 g(−10) ≈ 01 1 g(10) ≈ 1

Table 2: Evaluation of a neural net for logical AND.

x1 x2 hΘ(x)0 0 g(−10) ≈ 00 1 g(+10) ≈ 11 0 g(+10) ≈ 11 1 g(+30) ≈ 1

Table 3: Evaluation of a mystery neural net. Here theweights are: Θ10 = −10, Θ11 = +20, and Θ12 = +20, sohΘ(x) = g(−10 + 20x1 + 20x2). From the table, it is clearthat this is x1ORx2.

Now notice that the zi columns look like matrix multiplica-tion! Define

x =

x0

x1

x2

x3

z(2) =

z(2)1

z(2)2

z(2)3

(73)

and then note that

z(2) = Θ(1)x =⇒ a(2) = g(z(2)) (74)

To add a bit of consistency, we define the inputs x = a(1),so that

z(2) = Θ(1)a(1) =⇒ a(2) = g(z(2)) (75)

We also take a(2)0 = 1. Then we have that

z(3) = Θ(2)a(2) =⇒ hΘ(x) = a(3) = g(z(3)) (76)

This is called forward propagation!Intuitively, this looks a LOT like standard logistic

regression, but we sortof learn our own features, determinedby the weight matrix–so we’re adding a lot of complexity.

We can have other network architectures–Anything that isnot an input layer or output layer is called a “hidden layer”.

8.5 Examples and Intuitions

If we want to learn AND, then we could take bias unit tobe −30, and then each weight on x1, x2 to be +20. Thenhθ(x) = g(−30+20x1 +20x2). Then consider the table (2).

8.6 Examples and Intuitions II

It is quite simple to find a logical NOT network–just take+10 bias and −20 weight, so that hΘ(x) = g(10− 20x1).

To find a network that represents the boolean function(NOTx1)AND(NOTx2), consider the possibilities in table(4). Clearly the boolean function is presented best byh1θ(x).Finally, consider XOR/XNOR problem (figure 1307).

Since we can express x1XNORx2 as the OR of x1ANDx2

with (NOTx1)AND(NOTx2). Putting together the thingswe’ve already built makes this possible.

One fun example: Yan Lecun’s handwriting recognitioncode for reading zipcodes.

x1 x2 h1θ(x) h2

θ(x) h3θ(x) h4

θ(x)0 0 g(10) ≈ 1 g(30) ≈ 1 g(−10) ≈ 0 g(20) ≈ 10 1 g(−10) ≈ 0 g(10) ≈ 1 g(10) ≈ 1 g(−10) ≈ 01 0 g(−10) ≈ 0 g(10) ≈ 1 g(10) ≈ 1 g(0) ≈ 0.51 1 g(−30) ≈ 0 g(−10) ≈ 0 g(30) ≈ 1 g(10) ≈ 1

Table 4: Here we have: h1θ(x) = g(10 − 20x1 − 20x2),

h2θ(x) = g(30−20x1−20x2), h3

θ(x) = g(−10 + 20x1 + 20x2),and h4

θ(x) = g(20 + 20x1 − 30x2)

8.7 Multiclass Classification

For multiclass classification with neural networks, webasically build on the one-vs-all example by having anoutput unit for each possible class; that is, [hθ(x)]i = 1 ifand only if x is from class i. Eg: Picture classification mighttry to distinguish pedestrian vs car vs truck vs motorcycle.

9 Neural Networks: Learning

9.1 Cost Function

Now we’d like to find out how to automatically learnthe parameters of the network instead of building themourselves. To do this, we’ll need a cost function, as always!

Our given data is of the form {x(i), y(i)} for i = 1,m.Suppose we have L being the total number of layers in thenetwork, and sl is the number of units (not counting biasunit) in layer l.

9.1.1 Binary Classification

Here it is easy–our classes are y = 0 or y = 1, and we have1 output unit. Then clearly hΘ(x) ∈ R, sL = 1, and K = 1.

9.1.2 Multi-class classification

Suppose K classes. Then y ∈ RK . For example:1000

,

0100

,

0010

,

0001

(77)

might represent pedestrian, car, motorcycle, and truck, re-spectively. This requires K output units! Also hθ(x) ∈ RKand sl = K, for K ≥ 3.

The cost function for neural nets will be a generalizationof logistic unit, so we have

J(Θ) = − 1

m

[m∑i=1

K∑k=1

y(i)k log hΘ(x(i))k (78)

+ (1− y(i)k ) log(1− (hΘ(x(i)))k)

](79)

+λ

2m

L−1∑l=1

sl∑i=1

sl+1∑j=1

(Θ(l)ji )2 (80)

We have defined hθ(x) ∈ RK and (hΘ(x))i is the ith output.

9.2 Backpropagation Algorithm

Since we’d like to compute minΘ J(Θ) using an advancedoptimization method, we will need to compute

1. J(Θ)2. ∂

∂Θ(l)ij

J(Θ).10

Quick recap: Given one training example x, y, we firstapply forward propagation:

a(1) =x (81)

z(2) =Θ(1)a(1) (82)

a(2) =g(z(2)) (83)

z(3) =Θ(2)a(2) (84)

a(3) =g(z(3)) (85)

z(4) =Θ(3)a(3) (86)

a(4) =g(z(4)) (87)

(88)

This allows us to compute all the activations. We’ll try to

compute δ(l)j which is the error of node j in layer l. Then

δ(4)j =a

(4)j − yj (89)

δ(3)j =(Θ(3))T δ(4). ∗ g′(z(3)) (90)

δ(2)j =(Θ(2))T δ(3). ∗ g′(z(2)) (91)

(92)

where g′(z(i)) = a(i). ∗ (1− a(i)). It is possible to prove that

∂

∂Θ(l)ij

)J(Θ) = a(l)j δ

(l+1)i (93)

So the backpropagation algorithm, given a training set

{(x(1), y(1)), ...(x(m), y(m))}. Set ∆(l)ij = 0 for all l, i, j. Then

for i = 1 to i = m: (for each training point)

1. Set a(1) = x(i)

2. Perform forward propagation to compute a(l) forl = 2, 3, ..., L

3. Using y(i), compute δ(L) = a(L) − y(i).4. Compute δ(L−1), δ(L−2), ..., δ(2)

5. Let ∆(l)ij := ∆

(l)ij + a

(l)j δ

(l+1)i . (vectorization possible:

∆(l) := ∆(l) + δ(l+1)(a(l))T .)

Then finally, we calculate

D(l)ij :=

1

m∆

(l)ij +

{λΘ

(l)ij if j 6= 0

0 if j = 0.(94)

And it is possible to prove that ∂

∂Θ(l)ij

J(Θ)D(l)ij .

9.3 Backpropagation Intuition

Backprop is a bit harder to really get the intuition. We’llconsider the mechanical steps.

Consider a 4-layer network, 2 inputs, one output.Backpropagation is just the chain rule for the forwardpropagation. Formally,

δ(l)j =

∂

∂zlj(y(i) log hΘ(x(i)) + (1− y(i)) log(1− hΘ(x(i))))

(95)

Listing 17: Reshaping

Theta1=ones ( 1 0 , 1 1 ) ;Theta2=ones ( 1 0 , 1 1 ) ;Theta3=ones ( 1 , 1 1 ) ;thetaVec =[Theta1 ( : ) ; Theta2 ( : ) ; Theta3 ( : ) ] ;reshape ( thetaVec ( 1 : 1 1 0 ) , 1 0 , 1 1 )reshape ( thetaVec (111 : 220 ) , 10 , 11 )reshape ( thetaVec ( 2 2 1 : 2 3 1 ) , 1 , 1 1 )

9.4 Implementation Note: Unrolling Parameters

For the advanced minimization algorithms, we will need tounroll the matrices into matrices. We can do this with theTheta1(:) idea or reshape.

In practice, have initial matrices Θ(1) etc. We unrollthem to get initialTheta to pass to fminunc. Then in ourcost function, we get Θ(1) etc back from thetaVec, useforward prop/backprop to compute D(i) and J(θ). Thenunroll D(i) to get gradientVec!

9.5 Gradient Checking

Backprop has a lot of subtle errors, so we want to usegradient checking to make sure that our backprop isproperly implemented.

Gradient checking works by finite differences. By basiccalculus, we know that (for a θ ∈ R)

∂

∂θ≈ J(θ + ε)− J(θ − ε)

2ε(96)

This is the center difference approximation. We can applythis idea to a parameter vector θ as well, which will giveus the full gradient! This is an approximation, though! Wecan implement it as (18). We will use numerical gradientapproximation to ensure that DVec and gradApprox aresimilar. BUT never use gradient checking in real codes–itis too slow!

Listing 18: finite difference gradient check

for i =1:nthetaP = theta ;thetaP ( i ) = thetaP ( i ) + EPSILON;thetaM = theta ;thetaM ( i ) = thetaM ( i ) + EPSILON;gradApprox ( i ) = ( J ( thetaP)−J ( thetaM ) ) / . . .

(2∗EPSILON ) ;end

9.6 Random Initialization

One last point: We need random initialization for Θ forneural networks, since then parameters will all end upidentical. We use Symmetry breaking, where we initializethings in [−ε, ε]. We might use something like shown in (19)

Listing 19: Symmetry Breaking

Theta1=(2∗rand(10 ,11)−1)∗ INIT EPSILON ;Theta2=(2∗rand(1 ,11)−1)∗ INIT EPSILON ;

11

9.7 Putting it Together

1. First pick a network architecture, the connectivitybetween neurons. The number of input units is thedimension of features x(i) and the number of outputunits is the number of classes. The reasonable defeaultis 1 hidden layer, or if more than one hidden layer, havethe same number of hidden units in each layer–usuallythe more, the better. Usually, hidden units in each layerwill be approximately the number of input units.

2. Next, we need to implement:

(a) Randomly initialize weights(b) Implement forward prop to get hΘ(x(i)) for any x(i)

(c) Implement code for the cost function J(Θ).(d) Implement backprop to compute partial derivatives

∂

∂Θ(l)jk

J(Θ).

i. Perform forward and back-propagation using(x(i), y(i))

ii. Get activations a(l) and delta-terms δ(l) forl = 2, ..., L.

iii. Get ∆(l) := ∆(l) + δ(l+1)(a(l))T

and now we have ∂

∂Θ(l)jk

J(Θ) as desired.

(e) Use gradient checking to compare shit using back-prop and using numerical estimate. Then Disablegradient checking!

(f) Using gradient descent or advanced optimizationmethod with backprop to minimize J(Θ). This isnon-convex, so we might have local optima!

9.8 Autonomous Driving

We can use neural networks for autonomous driving. Itbasically uses neural nets to learn the steering based onwhat it is seeing vs what the human is doing. This isdone with just three hidden layers. Does this 12-times persecond. They use multiple networks for multiple situations.

10 Advice for Applying Machine Learning

10.1 Deciding what to Try Next

Debugging a learning algorithm. Suppose we use regular-ized linear regression, and find unacceptably large errors inpredictions. Some things to try:

1. Get more training examples (but careful here!)2. Try a smaller set of features–reduce overfitting3. Try getting additional features(but this can be expensive)4. Adding polynomial features (x2

1, x22, x1x2,...)

5. Decreasing λ6. Increasing λ.

These should be picked systematically, not at random!Diagnostic: A test you can run to gain insight into

what is/isn’t working with a learning algorithm and gainguiddance as to how best to improve performance.

10.2 Evaluating a Hypothesis

Really low error in training set might not be a good measureof performance. So we split the training set into a trainingset and a test set. Typically we choose 70% and 30% forthese sets, respectively. As always, (x(i), y(i)) i = 1, ...,m,

but also introduce mtest and (x(i)test, y

(i)test), i = 1, ..,mtest to

cross-validate the data.Then we can diagnose bad overfitting by looking for cases

where J(θ) is low and Jtest(θ) is high.

We would want to pick a random 30%.Here’s how it would work in practice:

1. Learn parameter θ from training data, minimizingtraining error J(θ)

2. Compute test set error Jtest(θ)

We can also use misclassification error(0/1 misclassifica-tion error):

err(hθ(x), y) =

{1 ifhθ(x) >= 0.5, y = 0orhθ(x) < 0.5, y = 1

0 otherwise(97)

Then the test error is

1

mtest

mtest∑i=1

err(hθ(x(i)test), y

(i)test). (98)

10.3 Model Selection and Train/Validation/TestSets

We also might sometimes want to pick which λ or whatorder of polynomial is used, etc. This is the model selectionproblem. One example: d is the degree of the polynomial:

1. d = 1: hθ(x) = θ0 + θ1x2. d = 1: hθ(x) = θ0 + θ1x+ θ2x

2

3. d = 1: hθ(x) = θ0 + θ1x+ · · ·+ θ3x3

4. etc5. d = 10: hθ(x) = θ0 + θ1x+ · · ·+ θ10x

10

Denote the solution you get from d as θ(d). Then we couldcompute Jtest(θ

(d)) and pick the d which gives the bestvalue; suppose d = 5 gives the best fit. How well does themodel generalize? Report test set error Jtest(θ

(5)).

But there’s a problem: Jtest(θ(5)) is likely to be an overly

optimistic estimate of generalization error because ourextra parameter d is fit to the test set.

How to get around it? Given the test set, we split intothree pieces:

1. Training Set(typically 60%), denoted (x(i), y(i))i = 1, ...,m

2. Cross-Validation set(typically 20%), denoted (x(i)cv , y

(i)cv ),

i = 1, ...,mcv

3. test set(typically 20%). denoted (x(i)test, y

(i)test),

i = 1, ...,mtest

Then we also have three different error functions:

Jtrain(θ) =1

2m

m∑i=1

(hθ(x(i))− y(i))2 (Training)

Jcv(θ) =1

2mcv

mcv∑i=1

(hθ(x(i)cv )− y(i)

cv )2 (Cross-Validation)

Jtest(θ) =1

2mtest

mtest∑i=1

(hθ(x(i)test)− y

(i)test)

2 (Test)

Then we do the following:

1. For each potential model d, pick θd by minθ Jtrain(θ),and evaluate Jcv(θ

(d))2. Pick the best model d3. estimate the generalization error by Jtest(θ

(d)).12

10.4 Diagnosing Bias vs Variance

If a learning algorithm doesn’t work well, it’s almostalways either high bias or high variance. With training,cross-validation, and test data, we’re much better preparedto deal with this. In particular, Jcv(θ) is usually quadratic-shaped(in d), and Jtrain(θ) is usually 1/d-shaped(in d).Note here that d is a proxy for complexity of the model,but specifically stands for the degree of polynomial. So ifd = 1 or so, then high error indicates high bias. If d is high,and Jcv error is high, then this is high bias.

The converse, then, says that

1. High bias is indicated by Jtrain high and Jcv isapproximately Jtrain.

2. High variance is indicated by Jtrain low and Jcv � Jtrain.

10.5 Regularization and Bias/Variance

Regularization can help prevent overfitting, as we’ve seen.But we need to come up with a way to choose the regu-larization parameter λ. We might try 12 different modelsfrom λ = 0 to λ = 10.24.Then we get θ(i) for i = 1, .., 12.We evaluate each on the cross-validation set: Jcv(θ

(i)),and pick the best one i∗. Then we return the test erroras Jtest(θ

(i∗)). When we do the plots, it comes out theopposite of d for λ–we get high bias regime on the rightand high variance on the left.

10.6 Learning Curves

Works as a sanity check and a diagnostic. To do this, weplot the Jtrain and Jcv as a function of m. figure from 1350.

High bias case: Jcv and Jtrain converge to eachother ata high level of error.

High Variance case: Jtrain looks about the same, and Jcvis much higher. The gap is a primary indication of highvariance. In this case, we might suppose that adding muchmore data might bring the gap tighter, and from this wecan say that

10.7 Deciding what to do next Revisited

1. Get more training examples: fixes high variance2. Try a smaller set of features: fixes high variance3. Try getting additional features: fixes high bias.4. Adding polynomial features: fixes high bias.5. Decreasing λ: fixes high bias6. Increasing λ: fixes high variance.

Neural networks: Small neural networks have fewerparameters and more prone to overfitting, but are com-putationally cheaper. Large neural networks have moreparameters and are more prone to overfitting, while compu-tationally more expensive. Larger neural networks usuallyuse regularization to address overfitting. You can also trycross-validating across varying numbers of layers in theneural networks.

11 Machine Learning System Design

We’ll consider Spam classification.

11.1 Prioritizing What to Work On

Suppose we want to do two classes:

1. Spam(y = 1)2. Not-spam(y = 0)

Suppose we want to get a supervised learning where x is abunch of features of email and y is spam or not spam. Oneapproach: Choose 100 words indicative of spam/not spam.

(eg: deal, buy, discount, andrew, now,...). That is, xi isassociated with the ith word in a list and

xj =

{1 if word j appears in email

0 otherwise(99)

Note: in practice we would just take the most frequentlyoccuring n words (10k-50k) in the training set, rather than100 words.

What are some potential ways to get good data?

1. Collect lots of data (eg, from a “honeypot” project)2. Develop sophisticated features based on email routing

information (from email headers)3. Sophisticated features for message body, like: should

“discount” and “discounts” be the same word? Deal vsDealer?

4. develop sophisticated algo to detect misspellings.

11.2 Error Analysis

Start with a simple algo that you can implement quickly.Implement it and test on cross-validation data. Thenwe can plot learning curves to decide if more data, morefeatures, etc are likely to help.

Error analysis: Manually examine the examples in thecross validation set that algorithm made errors on. See ifyou can spot any systematic trend in what type of examplesit is making errors on. This inspires new features.

Eg: In misclassified set, categorize them based on:

1. what type of email (eg, pharma, replica, steal passwords,etc)

2. what cues you think would have helped the algorithmclassified them correctly.

Then we pick the biggest categories or the most commonflags.

This is why we want a quick and dirty algo: to give usthe hard examples.

Another idea, for spam classifier: Porter stemmer. Butthis can also be problematic (eg, universe vs university?)Only solution: Try it and see if it works!

Note here that we use the cross-validation set to estimateerror analysis, NOT the test data–otherwise, Jtest is not agood estimate of the generalization error. Eg: If stemminggives 3% error and not stemming gives 5% error, usestemming. It is important to have a numerical test or singlereal-number metric of performance.

Don’t worry about something being too quick anddirty–it’s almost impossible. Just get a prototype!

11.3 Error Metrics for Skewed Classes

So, single error metrics are extremely important. But it’salso extremely tricky when we have very skewed classes.

For example: train a cancer classification example, andwe got 1% error on test set, and 0.5% of patients havecancer. One clear problem of this is shown in listing (20),which actually has less error than our learning algorithm!

Listing 20: Skewed Predictor

function y= pred ictCancer ( x )y=0;

return

So we need something better in the case of skewed classes.For very skewed classes, we want to use precision/recall Wewant a table like (11.3). We define two things:

13

actual 1 actual 0predicted 1 true positive false positivepredicted 0 false negative true negative

1. Precision: Of all patients where we predicted y = 1,what fraction actually has cancer?

true pos

# predicted pos=

true pos

true pos + false pos(PRECISION)

2. Recall: Of all patients that actually have cancer, whatfraction did we correctly detect as having cancer?

true pos

# actual pos=

true pos

true pos + false neg(RECALL)

There is no way for an algorithm to kinda “cheat” precisionand recall, so high values of both give us an indicator for agood algorithm.

Typically we label the more rare class with y = 1.

11.4 Trading off Precision and Recall

We can actually even switch off between precision andrecall. Suppose we have a logistic regression categorizer0 ≤ hθ(x) ≤ 1 and predict 1 if hθ(x) ≥ β and 0 if hθ(x) < βand β = 0.5 (typical case!).

We can come up with a higher precision algorithm byincreasing β–this corresponds to saying we predict cancer ifwe are 100β%-sure that they have cancer. But this lowersrecall! We miss some cancer patients by only predictingwhen we are more confident.

We can also come up with a higher recall algorithm bydecreasing β–this corresponds to predicting cancer even ifwe are less sure that they have it. This helps us avoid falsenegatives, and thereby increases recall. Unfortunately, thislowers precision.

Generally–you want to predict y = 1 if hθ(x) ≥ β. Youcan even make a precision recall curve–make a scatterplotof recall vs precision for a bunch of β values and see whatshape it follows.

Now, what if we have a bunch of algorithms? Can wechoose precision and recall automatically? Some ideas:

1. Average: P+R2 – but not very good.

2. F1 score: 2 PRP+R . This is more typically used.

In summary: Measure precision and recall on the crossvaldiation set and choose the value of the threshold whichmaximizes 2 PR

P+R .

11.5 Data for Machine Learning

How much data should we train on? Well, under certainconditions, we want as much as possible.

Banko and Brill, 2001. Eg: Classify between confusablewords.

It’s not who has the best algorithm that wins. It’swho has the most data.

(Sometimes.)The large data rationale: Assume features x ∈ Rn+1 has

sufficient information to predict y accurately.

1. Example: “For breakfast I ate (blank) eggs”–humanenglish expert could fill in.

2. Counterexample: Predict housing price from only sizeand no other features.

Useful test: Given input x, can a human expert confidentlypredict y?

Algorithms: should be low bias(That is, Jtrain(θ) issmall). If we use a very large training set, we are unlikelyto overfit, so Jtrain(θ) ≈ Jtest(θ). THen Jtest(θ) should besmall–this is a case where we suspect the large data rationaleto hold, and we should throw as much data as possible at it.

Put in the converse, a large training set is unlikely tohelp when:

1. The features x to not contain enough information topredict y accurately and when we are using a simplelearning algorithm such as logistic regression

2. The features x do not contain enough information topredict y accurately, even if we are using a neuralnetwork with a large number of hidden units.

12 Support Vector Machines

12.1 Optimization Objective

How about an alternative view of logistic regression?Suppose we have

hθ(x) =1

1 + exp(−θTx)(100)

so if y = 1 we want hθ(x) ≈ 1 so we want θTx � 0, and ify = 0, we want hθ(x) ≈ 0, so we want θTx� 0.

Consider the cost function of one example: (and introducez = θTx)

(−y log hθ(x) + (1− y) log(1− hθ(x))) (101)

So if y = 1, then only the first term matters: − log 11+exp(−z) .

Since we’re minimizing the cost function, the minimumoccurs as z gets larger and larger. The main idea of supportvector machines is to approximate − log 1

1+exp(−z) by

=

{0 z > 112 − z z < 1

(102)

(see plot 1757) We can do an equivalent thing for the y = 0case. We call these two cost functions cost1(z) and cost0(z).So now we have a support vector machine optimizationwhich is similar to logistic regression:

minθ

m∑i=1

y(i)cost1(θTx(i)) + (1− y(i))cost0(θTx(i)) +λ

2

n∑j=0

θ2j

(103)[nb: I think Ng made a mistake here–second summationshould be j = 1 to n.] We have made a couple of changes:

1. We multiply through by m–this changes the minimumbut we only care about the minimizer.

2. Instead of solving minA+λB we solve minCA+B, whichagain only changes the minimum, not the minimizer.(here, c = 1

λ .)

One important point: The hypothesis function DOESNOT RETURN PROBABILITY! We have the hypothesis:

hθ(x) =

{1 θTx

0 otherwise.(104)

14

12.2 Large Margin Intuition

Support vector machines are also known as Large MarginClassifiers. Quick recap:

1. If y = 1, we want θTx ≥ 1 (not just ≥ 0)2. If y = 0, we want θTx ≤ −1 (not just < 0)

So this is saying we don’t just want the logistic functionrequirement, we want something kindof even stronger!

Suppose we take C huge. Then we really want each termequal to zero. Then we’re very motivated to have θTx(i) ≥ 1whenever y(i) = 1 and θTx(i) < −1 whenever y(i) = 0.Then in the linearly separable case, we have something like(ss 1810). This basically trying to separate the positive andnegative examples with as large of margin as possible.

What happens with outliers? If C was very large, thenwe are extremely sensitive to outliers while if C is moremoderate then we accept a few errors to have a clearer andsimpler boundary.

12.3 Mathematics Behind Large Margin Classifi-cation

First off, uT v is the inner product. We have ||u|| being the

length of vector u, which is ||u|| =√u2

1 + u22 ∈ R. We can

also define p to be the length of projection of v onto u, thenuT v = p · ||u||. But it’s usually formulated as

uT v = u1v1 + u2v2 (105)

Furthermore, p is a signed number–it can be negative.If we simplify by making θ0 = 0, then we can write the

SVM problem as

minθ

1

2

n∑j=1

θ2j

s.t. θTx(i) ≥ 1 if y(i) = 1

θTx(i) ≤ −1 if y(i) = 0.

(106)

Then clearly the objective 12 (θ2

1 + θ22) = 1

2 (√θ2

1 + θ22)2 =

||θ||2. Then we are just minimizing the squared norm. Next,

note that θTx(i) = p(i)||θ|| = θ1x(i)1 + θ2x

(i)2 . That means

we can replace the constraints and have a problem like

minθ

1

2||θ||2

s.t. p(i)||θ|| ≥ 1 if y(i) = 1

p(i)||θ|| ≤ −1 if y(i) = 0.

(107)

It is possible to show that θ is perpendicular to the separat-ing boundary. Then we can (straightforwardly) argue that||θ|| can be made much smaller if the separating boundarymakes p(i) as large as possible. Since we are trying tominimize ||θ|| and still satisfy constraints, it is clear thatthe largest margin decision boundary is picked by the SVMproblem.

Same large-margin proof works when θ0 6= 0.

12.4 Kernels I

To make nonlinear classifiers, we can use complex poly-nomial features. In particular, we can define a bunch offeatures fi from the original ones xj . Is there a different/-better choice of features than the quadratic terms?

One idea: Given x, compute new feature depending onproximity to landmarks l(1), l(2), and l(3). Eg: given x,

fi = similarity(x, l(1)) = exp(−||x− l(i)||

2σ2) (108)

where these are called Gaussian Kernels. We can denotethem as k(x, l(i)). Consider what happens when:

1. if x ≈ l(1): f1 ≈ exp(− 02

2σ2 ) ≈ 1

2. if x is far from l(1): f1 ≈ exp(− (large)2

2σ2 ) ≈ 0

It is clear that (relative to σ2 = 1), σ2 = 0.5 falls to zeromuch quicker and σ2 = 3 falls away much more slowly.

12.5 Kernels II

Last time we talked about similarity function. But wherewould we get landmarks? One way–We would pick land-marks at each points.Then we end up with m landmarks.This is nice–we have a method for talking about how close atest point is to training data. What does this gain us? Well,now instead of x(i) ∈ Rn+1, we have f i ∈ Rm, which can bea substantial reduction in the dimension of the problem! Tomake a prediction on a datapoint x, you compute featurevector f ∈ Rm+1 and predict y = 1 if θT f ≥ 0.

For training, we just compute f (i) for each x(i), and thenjust minimize

J(θ) = C

m∑i=1

y(i)cost1(θT f (i))+(1−y(i))cost0(θT f (i))+1

2

m∑i=1

θ2j .

(109)One last bit of implementation detail: Since

∑j θ

2j = θT θ,

we replace it with a rescaled kernel θTMθ. This givesrise to extremely fast algorithms for optimizing the costfunction J(θ).

A few words about bias and variance: Note that C = 1λ .

1. Large C: Lower bias, High Variance2. Small C: Higher bias, Lower variance

The other parameter we need to choose is σ2:

1. Large σ2 gives features that vary more smoothly, sohigher bias, lower variance

2. Small σ2 gives features that vary less smoothly, so lowerbias, higher variance.

12.6 Using an SVM

Some svm software is very good– Use linlinear, libsvm, etcto solve for θ. You still do need to specify

1. Choice of parameter C2. Choice of kernel

(a) No kernel/linear kernel (use if n is large and m issmall)

(b) Gaussian Kernel – need to choose σ2. Use if n smalland/or m large. Key point: Do perform feature scal-ing before using Gaussian Kernel, because otherwiselarge-scale variables will swamp out everything else.Requires writing a kernel function such as listing (21)

(c) Other kernels–need to satisfy “Mercers Theorem”so ensure SVM packages run correctly and do notdiverge. These are much less common

i. Polynomial Kernel: (xT l + c)d which hasparameters c, d.

ii. String Kernel15

Listing 21: Gaussian Kernel

function f = ke rne l ( x1 , x2 )f=exp(−norm( x1−x2 )ˆ2/(2∗ sigma ˆ 2 ) ) ;

end

iii. Chi-square kerneliv. Histogram intersection kernel

Just to reiterate: Choose whatever performs best on thecross-validation data!

Multiclass classification: Most SVM already build thisin. Otherwise, use one-vs-all classification.

When would you use Logistic regression vs SVM? Taken as the number of features, and m the number of trainingexamples.

1. If n is large (relative to m): Logistic regression or SVMwithout kernel. (n ≥ m, n = 10, 000, m = 10− 1000)

2. If n is small and m is intermediate: SVM with GaussianKernel (n = 1− 1000, m = 10− 10, 000)

3. If n is small but m is large: Create/add more features,then use logistic regression and SVM without a kernel.(eg n = 1− 1000, m = 50, 000+)

4. Neural network likely to work well for most of thesesettings, but may be slower to train. (And need to worryabout nonconvexity)

13 Clustering

13.1 Unsupervised Learning: Introduction

Here, we are given x(i) without the y(i) (still with i = 1...m).We ask an algorithm to just find some structure in thegiven data. The first type of algorithm we’ll look at is theclustering algirthm. This is good for

1. Market segmentation2. social network analysis3. organize computer clusters4. astronomical data analysis

13.2 K-Means Algorithm

We are given a set x(i) and ask it to find clusters. Thebasic idea: Initialize random cluster centroids. Then runtwo steps:

1. cluster assignment step: assign each datapoint to acluster centroid.

2. move centroid step: move centroids to the center of massof the points assigned to it.

More formally: K-means takes as input the number ofclusters K and the training set {x(i), ...}. Then

1. Randomly initialize K cluster centroids µ1,...,µK ∈ Rn2. repeat

(a) for i = 1, ...,m, c(i) is the index (from 1 to K) ofcluster centroid closest to x(i) (this is a minimizationproblem).

(b) for k = 1, ..,K: Let µk be the average (mean) ofpoints assigned to cluster k.

Some edge cases: If no points are assigned to a centroid,we might delete it or randomly re-initialize it.

Next, K-means for non-separated kernels. Sometimesdatasets aren’t easily separated. Eg: T-shirt sizing withheight and weight. This is in fact a market segmentationproblem.

13.3 Optimization Objective

Recall that c(i) is the index of cluster to which x(i) iscurrently assigned, and µk is the cluster centroid k.

Introduce some new stuff: µc(i) - the cluster centroid ofcluster to which example x(i) has been assigned.

Then the optimization problem is in terms of the functionJ

J(c(1), ..., c(m), µ1, ..., µK) =1

m

∑i=1

m||x(i) − µc(i) ||2 (110)

and then the minimization is

minc(1), · · ·, c(m)

µ1, · · ·, µK

J(c(1), ..., c(m), µ1, ..., µK) (111)

This is also called the distortion.In the above framework, the cluster assignment step

minimized J(...) with respect to c(i) holding µj fixed. Themove centroid step is equivalent to minimizing J(...) withrespect to µj holding c(i) fixed.

13.4 Random Initialization

Now we need to discuss how to avoid local optima. Shouldhave K < m–otherwise kinda weird. We randomly pick Ktraining examples. Then we set µ1,...,µK equal to these Kexamples. This is the truly recommended way to do this.K-means can end up at a local optima. This kinda

sucks. To get around this, we might try running K-meansmany times–eg, might do it with 50 − 1000 differentrandom initializations. Then we can be more assured in thegoodness of our optima. This matters most when you havesmall numbers of clusters.

13.5 Choosing the Number of Clusters

What is the right value of K? There probably isn’t a totallycorrect/perfect method. The most common thing is pickingthe number of clusters by hand.

One heuristic: The elbow method. Run K-means forvarying numbers of clusters and plot the number of clustersvs the cost function. The point of diminishing returns iscalled an elbow. But it’s not a great algorithm because itdoesn’t work frequently.

Another method: Frequently, a downstream purpose willimply a number of clusters.

14 Dimensionality Reduction

14.1 Motivation I: Data Compression

One example: We might want to reduce data from 2d to1d, for example: Centimeters and meters, with roundofferrors. Another way to look at it is that we might haveunderlying variables like pilot pilot skill and pilot enjoymentdetermining aptitude.

More formally, we might be able to represent x(m) ∈ R2

by projecting it into z(1) ∈ R.One really really important application is making

machine learning algorithms run faster.One (contrived) example: If all data in some subset of

R3 lies in R2, then we can just use the plane representationinstead. In fact, it’s not so contrived since we can projectfrom 3d into some 2d plane.

16

14.2 Motivation II: Visualization

Suppose we have something like 50 features, and we’d liketo visualize them to try and get some insight. One optionis to use a different feature representation to try and get itto some (z1, z2)–then we can plot it easily!

Country example: Might have z1 corresponding tocountry size/GDP and z2 corresponding to per-personGDP. We’d typically do data-reduction to get this down to2- or 3-dimensional data.

14.3 Principal Component Analysis ProblemFormulation

What exactly is PCA? PCA tries to find a lower-dimensionalsurface such that the projection from the dataset onto thelower-dimensional surface is relatively small.

One important detail: you should do feature scaling/meannormalization!

There are definitely terrible PCA surfaces.More formally: If we wanted to reduce from 2-dimensional

to 1-dimension, we find a direction (a vector u(1) ∈ Rn) ontowhich to project the data so as to minimize the projectionerror.

More generally: Reduce from n-dimensional to k-dimensional by finding k vectors u(1), u(2), ..., u(k) ontowhich to project the data so as to minimize the projectionerror. (We’ll construct a subspace onto which to project...)

How does PCA relate to linear regression? They don’t.They’re totally different algorithms.

14.4 Principal Component Analysis Algorithm

Before doing principle component analysis, we really reallyneed to do data preprocessing. So we have the training setx(i), and we do mean/variance scaling:

x(i)j ←

x(i)j − µjsj

(112)

The procedure is pretty simple. Here we go:

1. First compute the covariance matrix Σ =1m

∑mi=1(x(i))(x(i))T

2. Compute the singular value decomposition of Σ. (listing(22))

Listing 22: PCA Algorithm

[U, S ,V] = svd ( Sigma ) ;

(You can also use eig for this type of matrix (positivesemi-definite) but svd is a bit more stable.)

It turns out that U returned from SVD will give us ncolumn vectors u(i) ∈ Rn; we can use the first k of them toget the k directions we want to project our data.

Next, we need to find a way to get z ∈ Rk; to get therewe first introduce

Ureduce =

...

......

u(1) u(2) · · · u(k)

......

...

(113)

and then we can find

z = UTreducex (114)

Listing 23: cap

Sigma = (1/m)∗X’∗X;

We can do the covariance matrix asThe proof that all of this works is beyond the space of

this class.

14.5 Choosing the number of principal Compo-nents

We call k the number of principal compoents. The averagesquared projection error is

1

m

m∑i=1

||x(i) − x(i)approx||2 (115)

and the total variation in the data is

1

m

m∑i=1

||x(i)||2 (116)

Typically we choose k to be the smallest value so that

1m

∑mi=1 ||x(i) − x(i)

approx||21m

∑mi=1 ||x(i)||2

≤ 0.01 (117)

which is saying “99% of the variance is retained”, or theequivalent between 85%-99%.

To choose k, we can use the algorithm:

1. Try PCA with k = 12. Compute Ureduce, z

(1), etc3. Check if (117) holds

The check can be much more efficiently evaluated as

1−∑ki=1 Sii∑ni=1 Sii

(118)

So we can just slowly increase k until we have∑ki=1 Sii∑ni=1 Sii

≥ 0.99 (119)

Report the variance retained, not the dimensions.We get away with this basically only because datasets

tend to have very highly correlated data.

14.6 Reconstruction from Compressed Represen-tation

If we use it as a compression algorithm, then what can wedo to get the original data back(approximately)?

We takexapprox = Ureducez (120)

which we hope will give approximately x. This process iscalled reconstruction.

14.7 Advice for Applying PCA

The most common use of PCA might be supervised learn-ing speedup. Eg: Computer vision! If we had a 100x100image, then we get a 10,000 feature vector, which is a lot.alternative: Given (x(i), y(i))

17

1. Extract inputs into the unlabeled dataset x(i) ∈ R10000

and (by PCA) gets z(i) ∈ R1000.2. Have a new dataset: (z(i), y(i)) i = 1, ...,m.3. Now do logistic regression or neural nets or whatever

Critical point: The mapping from x(i) to z(i) should onlybe done by running PCA on the training set, then reusingthat mapping on the cross-validation and test set, sincethe matrix Ureduce is in some sense a new parameter of thesupervised learning setup!

Recap: Two applications

1. Compression

(a) Reduce memory/disk needed to store data(b) speed up learning algorithm

2. Visualization (mapping into k = 2 or k = 3)

One terrible application: To prevent overfitting. Idea isuse z(i) instead of x(i) to reduce the number of features tok < n (eg, 1000 < 10000); thus, fewer features, less likelyto overfit. BAD IDEA – might work okay, but doesn’tactually address overfitting like regularization would. (It’sbad because it throws away the label data.)

PCA is sometimes used where it should not be. Oneclear case is deciding to try the ML system:

1. Get training set (x(i), y(i))2. Run PCA to reduce x(i) in dimension to get z(i)

3. Train logistic regression on (z(i), y(i))

4. Test on test set: Map x(i)test to z

(i)test and run hθ(z) on

(z(i)test, y

(1)test)

The problem here is that PCA shouldn’t be done untilyou’ve tried the ML system without PCA(on the raw data)and that didn’t do what you wanted.

15 Anomaly Detection

15.1 Problem Motivation

Imagine that you design and manufacturing aircraft engines.Then you have a feature set xi which represents things likeheat generated, vibration intensity, etc. Then the problemis to determine if a new engine xtest is anomalous or not.

More formally: We estimate density by the dataset x(i)

and want to see if xtest is anomalous. We hope to build amodel p(x) where we say p(xtest) < ε is a flagged anomaly,and otherwise a typical point.

The most typical use might be fraud detection. Forexample:

1. x(i) is the feature vector of user i’s activity

(a) how often do they log in(b) number of web pages(c) number of posts on forum(d) typing speed.

2. Model p(x) from this data3. Identify unusual users by checking which have p(x) < ε.

Unfortunately, this only finds “strange people”, notfraudulent people.

We can also use this to monitor computers in a datacenter–

1. x(i) feature vector of machine i:

(a) memory use(b) number of disk accesses/sec(c) CPU load(d) ratio of CPU load vs network traffic.

15.2 Gaussian Distribution

Say x ∈ R. If x is a distributed Gaussian with mean µand variance σ2, then we write x ∼ N (µ, σ2). This is thestandard bell shape centered at µ and approximate width σ.

For completeness:

p(x;µ, σ2) =1√2πσ

exp(− (x− µ)2

2σ2). (121)

Sometimes it is easier to think in terms of variance σ2,sometimes easier in terms of standard deviation.

Parameter estimation problem: Suppose we have adataset {x(i), ...x(m)} with x(i) ∈ R. We’d like to find µand σ, supposing that x(i) ∼ N (µ, σ2).

The mean is easy:

µ =1

m

m∑i=1

x(i) (122)

and

σ2 =1

m

m∑i=1

(x(i) − µ)2 (123)

these are the maximum likelihood estimates. Sometimes1

m−1 instead of 1m , but in machine learning it doesn’t

matter too much.

15.3 Algorithm

Consider if we had an unlabeled training set x(i) fori = 1, ...,m. Each example is x ∈ Rn. We have

p(x) = p(x1;µ1, σ21)p(x2;µ2, σ

22) · · · p(xm;µm, σ

2m) (124)

This corresponds to an independence assumption! Morecompactly:

p(x) =

n∏j=1

p(xj ;µj , σ2j ) (125)

Algorithm:

1. Choose features xi that you think might be indicative ofanomalous features.

2. Fit parameters µ1,..,µn, σ21 ,..,σ2

n by standard statisticsformulas:

µj =1

m

m∑i=1

x(i)j (126)

σ2j =

1

m

m∑i=1

(x(i)j − µj)

2 (127)

3. Given a new example x, compute p(x):

p(x) =

n∏j=1

p(xj ;µj , σ2j ) =

n∏j=1

1√2πσj

exp(− (xj − µj)2

2σ2)

(128)and mark as an anomaly if p(x) < ε.

18

15.4 Developing and Evaluating an AnomalyDetection System

How do you evaluate an anomaly detection algorithm?Ideally–single real-valued measure of efficacy.

Assume that we have some labeled data, of anomalousand non-anomalous examples, labeled y = 1 and y = 0respectively. As before, we have x(i) for i = 1..m. We also

have a cross validation set x(i)cv , y

(i)cv and test set x

(i)test, y

(i)test.

Aircraft engines motivating example: Suppose we have10000 good engines and 20 flawed engines. Split this as

1. training set: 6000 good engines2. cross validation: 2000 good engines, 10 anomalous.3. test set: 2000 good engines, 10 anomalous.

There are alternatives, but they are not recommended.For algorithm evaluation:

1. Fit model p(x) on training set x(i).2. On cross validation, predict

y =

{1 ifp(x) < ε(anomaly)

0 ifp(x) ≥ ε(normal) (129)

3. Possible evaluation metrics:

(a) True positive, false positive, false negative, truenegative

(b) precision/recall(c) F1-score.

Can also use a cross-validation set to choose parameter ε.

15.5 Anomaly Detection vs Supervised Learning

Some places to use anomaly detection:

1. Very small number of positive examples (y = 1) (0-20 iscommon).

2. Large number of negative examples3. many different “types” of anomalies. Hard for any al-

gorithm to learn from positive examples what anomalieslook like; future anomalies may look nothing like any ofthe anomalous examples we’ve seen so far

Some places to use anomaly detection:

1. Large number of positive and negative examples.2. Enough positive examples for algorithm to get a sense of

what positive examples are like, future positive exampleslikely to be similar to ones in training set.

Counterexample: Spam, even through there’s many types.Use cases: Anomaly detection

1. Fraud detection2. Manufacturing3. Monitoring machines in a data center

Supervised Learning

1. Email spam classification2. Weather prediction3. cancer prediction.

15.6 Choosing what Features to Use

Features have a huge effect on anomaly detection. Oneimplicit assumption was the gaussian-like, at least vaguely.You should try plotting histograms of the data.

If you have trouble, try to transform it. Eg: If exponen-tially distributed, then try taking log. Then it will lookmuch more gaussian. Transformations:

1. x← log(x+ c)

2. x← xc (0 < c < 1.)

How do you come up with features? Well, we want p(x)large for normal examples, and p(x) small for anomalousexamples x. One common problem is that p(x) is compara-ble for normal and anomalous examples. We hope to comeup with features to distinguish datapoints by one problemgetting assigned an extremely low probability.

Case study: Computers in data center:

1. memory use of computer2. number of disk accesses/sec3. cpu load4. network load

We notice that cpu load and network load tend to growwith eachother. When would this not happen? When CPUload is high and network load is low. So we introduce a new

variable: x5 = cpuloadnetworktraffic or even x6 = (cpuload)2

networktraffic .

15.7 Multivariate Gaussian Distribution

One possible extension: Multiple gaussians. Considermontioring machines in a data center, with CPU load andmemory use statistics. Suppose in the testset we have aclear outlier. The outlier might not look too bad in anygiven dimension. Trouble is, the probability isosurfaces arecircles, not ellipses.

So we introduce multivariate Gaussian distributions:Suppose x ∈ Rn, and instead of modeling p(x1), p(x2),etc separately, we model p(x) all at once, with parametersµ ∈ Rn and Σ ∈ Rn×n. So we have

p(x;µ,Σ) =1

(2π)n/2|Σ| 12exp(−1

2(x−µ)TΣ−1(x−µ)) (130)

Suppose we have

µ =

(00

); Σ =

(1 00 1

)(131)

This gives a standard distribution with circular isosurfaces.What happens if we take

µ =

(00

); Σ =

(.6 00 0.6

)(132)

This gives narrower Gaussians. Can also make it bigger byincreasing diagonal elements of Σ.

Can also get ellipsoidal matrices by taking Σ with diag-onal entries different from eachother. Finally, you can getskewed gaussian distributions by changing the off-diagonalelements(these say that (eg) x1 is correlated to x2, whichmeans when one raises, the other does as well, giving riseto an appropriate plot based on that observation, as longas offdiag are positive). If off diagonal are negative, getinverse correlation.

We can also shift around the mean by changing µ, ofcourse.

15.8 Anomaly Detection using the MultivariateGaussian Distribution

Last time we looked at multivariate detection algorithm.Last time we saw

p(x;µ,Σ) =1

(2π)n2 |Σ| 12

exp(−1

2(x−µ)TΣ−1(x−µ)) (133)

19

and we do a parameter fitting by

µ =1

m

m∑i=1

x(i) Σ =1

m

m∑i=1

(x(i) − µ)(x(i) − µ)T (134)

where our dataset is {(x(i), x(i))}The general outline is:

1. Fit model p(x) by setting µ and Σ as above.2. Given a new example x, compute p(x)3. Flag an anomaly if p(x) < ε

It turns out that if the gaussians are axis-aligned,then the distributions discussed above end up being thesame, and the off-diagonals are zero. This is called theindependence assumption!

When would you use original model vs mutlivariategaussian model?

1. original:

(a) Manually create features to capture anomalies wherex1 and x2 take unusual combinations of values.

(b) computationally cheaper or scales better to large n.(eg n = 10, 000-n = 100, 000)

(c) works ok even if m is small.

2. multivariate:

(a) automatically captures correlations between features.(b) computationally much more expensive.(c) must have m > n or else Σ is noninvertable.

Ng would only use m� n, say m ≥ 10n.If Σ is nonsingular, you probably have

1. not enough features2. redundant features.

16 Recommender Systems

16.1 Problem Formulation

Many websites/etc try to recommend new products basedon other products consumers have consumed. Recom-mender systems are a relatively small part of academia, butthey are extremely common in industry.

Features have a huge effect on performance.Suppose we have: a table with movies down the rows,

and different users rate movies 0-5 starts. One interestingthing is that we might be missing data. We introduce somenotation:

1. nu number of users2. nm number of movies3. r(i, j) = 1 if user j has rated movie j.4. y(i,j) is the rating user j gave to movie i, defined only ifr(i, j) = 1.

The job of a recommender system is to predict the valuesfor which r(i, j) = 0.

16.2 Content-Based Recommendations

We could define features like

1. x1 - romance2. x2 - action

Then for each user j, we could learn a linear regressionparameter θ(j) ∈ R3. Predict user j as rating movie i with(θ(j))Tx(i) stars.

To learn θ(j) we just calculate

minθ(j)

1

2m(j)

∑i:r(i,j)=1

((θ(j))T (x(i))− y(i,j))2 +λ

2m(j)

n∑k=1

(θ(j)k )2

(135)But consider it without the m(j) part:

minθ(j)

1

2

∑i:r(i,j)=1

((θ(j))T (x(i))− y(i,j))2 +λ

2

n∑k=1

(θ(j)k )2 (136)

Then to learn θ(1), ..., θ(nu), minimize

minθ(1),...,θnu

1

2

nu∑j=1

∑i:r(i,j)=1

((θ(j))T (x(i))−y(i,j))2+

nu∑j=1

λ

2

n∑k=1

(θ(j)k )2

(137)Then our optimization algorithm could be the typical

gradient descent steps:

θ(j)k :=θ

(j)k − α

∑i:r(i,j)=1

((θ(j))Tx(i) − y(i,j))x(i)k (138)

θ(j)k :=θ

(j)k − α(

∑i:r(i,j)=1

((θ(j))Tx(i) − y(i,j))x(i)k + λθ

(j)k )

(139)

This is basically just applied linear regression.One caveat: it might be hard/impossible to have features

we described.

16.3 Collaborative Filtering

We had assumed that we had the features. How would weget values for the romance/action parameters?

One possible way: suppose you had the partial ratings asbefore, but also had users subjective rating of various kindsof movies. Then we can infer the content feature vectorsof the movies. That is, we would like to learn x such that(θ(j))Tx(i) approximately matches user i’s subjective ratingof movie j., for the cases where they have rated it.

More formally: Given user’s preferences θ(1), ..., θ(nu), tolearn x(i), we want to

minx(i)

1

2

∑j:r(i,j)=1

((θ(j))Tx(i) − y(i,j))2 +λ

2

n∑k=1

(x(i)k )2 (140)

Of course, we also want to find the features for each movie1, ..., nm, so the actual optimization algorithm is

minx(1),...,x(nm)

1

2

nm∑i=1

∑j:r(i,j)=1

((θ(j))Tx(i)−y(i,j))2+λ

2

nm∑i=1

n∑k=1

(x(i)k )2

(141)

So then collaborative filtering says that if we have x(i)

(i = 1, .., nm) and movie ratings(r(i, j) and y(i,j)) we canestimate θ(1), ..., θ(nu).

Given the parameters θ(1), ..., θ(nu), we can estimatex(1),...,x(nm).

So we can guess θ, then use it to get x, then iterate thismany times until we converge.

Key point here: Every user is helping every other user.20

16.4 Collaborative Filter Algorithm

Turns out though, this algorithm kinda sucks. We cancome up with a better one.

We have our two steps as above:

1. Given x(i), estimate θ(j) (i ∈ {1..nm} and i ∈ {1..nu})2. Given θ(j) (i ∈ {1..nm} estimate x(i), and i ∈ {1..nu})(both with the same objective function, but minimizingover different variables and regularization)

One option: To go back and forth. A smarter idea: Putboth in the same objective function, and minimize overboth simultaneously:

J(x(i), θ(j)) =1

2

∑(i,j):r(i,j)=1

((θ(j))Tx(i)−y(i,j))2+λ

2

nm∑i=1

n∑k=1

(x(i)k )2+

λ

2

nm∑i=1

n∑k=1

(θ(j)k )2

(142)and then solve the minimization problem

minx(i),θ(j),∀(i,j)

J(x(1), ..., x(nm), θ(1), ..., θ(nu)) (143)

Previously, we used x0 = 1 and x ∈ Rn+1; here we takex ∈ Rn and θ ∈ Rn.

So the algorithm:

1. initialize x(i) and θ(j) to small random values.2. Minimize J(x(1), ..., x(nm), θ(1), ...θ(nu)) using gradient

descent or advanced optimization algoo.3. For a user with parameters θ and a movie with (learned)

features x, predict a star rating of θTx

We need to initialize with small random values to dosymmetry breaking to ensure that we don’t learn the samevalues multiple times.

16.5 Vectorization: Low rank Matrix Factorization

We can group all ratings into a matrix Y , a nm×nu matrix.Then the predicted ratings would be given by a matrixwhere the (i, j)-entry is θT(j)x

(i). That is, we can define

X =

· · · (x(1))T · · ·· · · (x(1))T · · ·

...· · · (x(1))T · · ·

Θ =

· · · (θ(1))T · · ·· · · (θ(1))T · · ·

...· · · (θ(1))T · · ·

(144)

Then the above matrix can be calculated as XΘT !The algorithm we will use is: low rank matrix factoriza-

tion.To find related movies: For each product i, we learn

a feature vector x(i) ∈ Rn. So we could find xi beingromance, action, comedy, etc. The next question is: Howto find movies j related to movie i. One measure: Finding

small ||x(i) − x(j)|| =⇒ movies i and j similar. (145)

16.6 Implementation Detail: Mean Normalization

One last implementation detail: should probably normalizeeverything with mean/standard deviation. This is impor-tant since algorithm will predict 0 for unknown people–notuseful unless we consider 0 to be a perfectly average ratingfor that movie.

17 Large-Scale Machine Learning

Much larger datasets are responsible for a huge amountof improvement. We saw that training size wins almostregardless of which algorithms.

17.1 Learning with Large Datasets

Suppose the training size is m = 100, 000, 000, and we wantto do the gradient update

θj := θj − α1

m

m∑i=1


(i)j (146)

Before we do this, we should ask: Why not just do a 1000examples? One way to check this is to plot a learning curvefor a range of values of m and verify that the algorithm hashigh variance when m is small. We know that if we saw ahigh bias case, then we could add features or hidden nodeswith neural nets to hopefully get to high bias, then addmore data!

17.2 Stochastic Gradient Descent

Suppose you have linear regression with gradient descent.The problem with regular(that is, ‘batch’) gradient descentis that the sum is very expensive to calculate–it is very slow!

We can reformulate by taking

cost(θ, (x(i), y(i))) =1

2(hθ(x

(i))− y(i)) (147)

and

Jtrain(θ) =1

m

m∑i=1

cost(θ, (x(i), y(i))). (148)

As noted before, the batch gradient step is (146), but thestochastic gradient descent works like this:

1. Randomly shuffle (reorder) training examples..2. Repeat

(a) for i = 1, ..,m,

θj := θj − α(hθ(x(i))− y(i))x

(i)j ∀j. (149)

Key point: This doesn’t actually converge–but that’s okay,since it’s a point near the optima.

Typically the outer loop will only be run 1-10 times, incontrast to a great many times we’d have to take gradientdescent steps.

17.3 Mini-Batch Gradient Descent

So far:

1. Batch gradient descent: Use all m examples in eachiteration

2. Stochastic gradient descent: Use 1 example in eachiteration

There’s also another option: Mini-batch gradient descent,where we use b examples in each iteration. That is, insteadof summing over the huge dataset, we only sum over b (say,b = 10) This looks like

1. Say b = 10, m = 1000. Then2. Repeat

(a) for i = 1, 11, 21, ..., 991

θj := θj − α1

10

i+9∑k=i

(hθ(x(k))− y(k))xkj ∀j. (150)

This will have positive effect if you have a good vectorizedimplementation.

21

17.4 Stochastic Gradient Descent Convergence

How do you tune α, and ensure it was converging? Duringlearning, compute cost(θ, (x(i), y(i))) before updating θusing (x(i), y(i)). Every (eg) 1000 iterations or so, plotcost(θ, (x(i), y(i))) averaged over the last 1000 iterationsprocessed by algorithm.

Three potential issues:

1. decreasing oscillation: okay!2. steady oscillation: decrease learning rate/bug3. increasing: decrease learning rate.

If you want stochastic gradient descent to actuallyconverge, then you can slowly converge to the properminimum, via:

αk =c1

k + c2(151)

17.5 Online Learning

Online learning works for applications where we have a“continuous flood” of data by a continuous stream ofvisitors (for example).

One application: Suppose you have a shipping servicewebsite where use comes, specifies origin and destination,you offer to ship their package for some asking price, andusers sometimes choose to use your shipping service (y = 1)and sometimes not (y = 0).

We might also suppose that we have some features xwhich capture properties of the user, origin/destination,and asking price. We want to learn p(y = 1|x; θ) to optimizeprice. Our website will run something like:Repeat forever

1. Get (x, y) corresponding to user2. Update θ using (x, y):

θj := θj − α(hθ(x)− y)xj ∀j. (152)

One interesting thing about this is that it can adapt tochanging user preferences.

Another one: Product search(learning to search). Usersearches for “android phone 1080p camera”. We have 100phones in the store, and will return 10 results. In this case,

1. x is the features of the phone, how many words in userquery match, name of phone, how many words in inquery match description of phone, etc

2. y = 1 if user clicks on link, y = 0 otherwise.3. we learn p(y = 1|x; θ).4. We can use this to show user the 10 phones they’re most

likely to click on.

Other examples: Choosing special offers to show user,customizing selection of news articles, product recommen-dation, etc.

Alternative: You can run website for a few days, saveaway data, etc.

17.6 Map Reduce and Data Parallelism

We might have problems too big to fit in a single computer.Here we extend it to fit.

Suppose we want to run batch gradient descent onm = 400. (millions?). Then the mapreduce idea from JeffDean and Sanjay Ghemawat. Then for mapreduce, we splitthe dataset into 4 pieces, then send it off to other machines.That is,

1. Machine 1: Uses (x(i), y(i)), i = 1, .., 100 to calculate

temp(1)j =

∑1i=1 00(hθ(x

(i))− y(i))x(i)j .

2. ....

Then combines them as

θj := θj − α1

400(temp

(1)j + temp

(2)j + temp

(3)j + temp

(4)j )

(153)The key question is: Can the learning algorithm be ex-

pressed as computing sums of functions over the training set.We can, for example, use advanced optimization with

logistic regression, in a mapreduce framework.One other application: We can use mapreduce on multi-

core machines. In this application, we don’t have to careabout network latency.

Some numerical linear algebra libraries can applyparallelization automatically.

18 Application Example: Photo OCR

18.1 Problem Description and Pipeline

What is the photo OCR problem? Optical CharacterRecognition. We’d like to get computers to understandcontent from pictures, like printed words.

There are several parts:

1. Where are the texts?2. What do the texts represent?

OCR from scanned documents is relatively easy nowa-days. Photo OCR is harder.

The pipeline looks like this:

1. Text detection–find where there is text in the image2. Character segmentation–try to segment the text part

into distinct characters3. Character classification–try to turn segments from

images into categories representing characters.

You can even get more sophisticated, eg with spellingcorrectors.

18.2 Sliding Windows

Pedestrian detection is easier, mostly because the aspectratio is approximately the same for different pedestrians.

We would apply supervised learning for pedestriandetection by finding many 82x36 pixel images, with positiveand negative examples. Might use a neural network or otherto classify image patch as pedestrian being in it or not.Then big idea is that we take a rectangular patch and runit through the image classifier. Then we slide the rectangleover a bit. Run the next patch through classifier. We dothis over and over and over. Here “a bit” might be 1-32pixels, depending on the set. So we’ve run all these differentimages through the neural net. Next we keep trying largerand larger samples(scaling them down to 82x36 each time).Eventually, this should give a list of pedestrians.

For text detection, we again use supervised learning. Wetake the positive set to be random images with text, andthe negative set to be non-text examples.

Once we’ve done this, we get a heat-map of likely images–white to show text and gray to show probability of the textclassifier. Next we apply an expansion operator–eg: within5px or 10px of a white pixel? (This is necessary to ensurethat our bounding boxes include spaces between contiguouscharacters.) Then heat-map image is also white where-everthere is text in the original image. Finally, we applybox-finding to the contiguous white regions, the connectedregions. Then we select the ones with aspect ratios thatlook roughly correct–the aspect ratio is important becausemost text is wider than it is tall.

22

Back to character segmentation: Take positive examplesbeing the splits between two characters, and negative ex-amples to be “whole”, distinct characters. Then we train aclassifier (with neural network, etc). Next, we run classifica-tion on a sliding window of boxes in our text; the positively-labeled test examples will be the splits and the negatively la-beled test examples will be the places we shouldn’t split text.

To recap: Our photo OCR pipeline should have threephases:

1. Text detection -2. Character Segmentation -3. Character classification - Typical neural network/other

classifier.

18.3 Getting Lots of Data and Artificial Data

One of the best ways to get high performance: Take a lowbias algorithm and train on tons and tons of data. So howdo we get the huge datasets? Two ideas:

1. Artificial data synthesis2. dataset augmentation

Consider photo recognition. We can use huge fontlibraries and paste them against random backgrounds.Maybe you want to apply some scaling/affine/etc to train-ing data. This will give us basically an unlimited supply oftraining data.

We can synthesize by inducing distortions. For example,each training example might give rise to 16+ differentdistortions. But it’s important to use distortions thatwould arise in practice. That is, DO NOT add purelyrandom/meaningless noise, that won’t help (usually).

Audio/speech recognition example, from original data

1. audio on bad cellphone connection2. noisy background, crowd3. noisy background, machinery.

Don’t just throw in duplicates–you’ll just end up trainingthe same parameters, half as quickly.

A couple final points

1. Make sure you have a low bias classifier! (eg, plotlearning curves)

2. “How much work would it be to get 10x as much data aswe currently have.” – usually this will make the machinelearning algorithm do much better.

(a) Artificial data synthesis(b) Collect/label it yourself.(c) Crowd source–eg, amazon mechanical turk.

18.4 Ceiling Analysis: What part of the Pipelineto work on Next

Your time is key! (Or your team’s time) Don’t work onstuff that won’t work! How to pick what parts of pipelineto work on?

Suppose the overall system has 72% accuracy(or anothermetric). Then next consider what the other stages woulddo if they had perfect data from the prior stages?

1. Given perfect text detection: 89%2. Given perfect text detection and character segmentation:

90%3. Given perfect text detection and character segmentation

and character recognition: 100%

This tells us we have approximately 17%, 1%, or 10% fromworking on text detection, character detection, or characterrecognition, respectively.

Another ceiling analysis example. Pipeline:

1. Camera image2. preprocess(remove background)3. face detection

(a) eye segmentation(b) nose detection(c) mouth segmentation

4. logistic regression

Then we analyze gains from each stage in the pipeline.Don’t spend human-years on a segment which will not

improve performance!Don’t trust your gut on what component to work on–do

a ceiling analysis every time.

19 Conclusion

19.1 Summary and Thank You

Basic summary:

1. Supervised learning: Linear regression, logistic regres-sion, neural networks, SVMs (Have (x(i), y(i)))

2. Unsupervised Learning: K-means, PCA, Anomalydetections (Have x(i))

3. Special Applications/topics: Recommender systems,large scale machine learning

4. Advice on building a machine learning system: Bias/-variance, regularization, deciding what to work on next,evaluation of learning algorithms, learning curves, erroranalysis, ceiling analysis.

You are now an expert inmachine Learning! Hooray!

23

mlclass notes

Documents