cs department - lecture 6 - linear classificationmtappen/cap5415/lecs/lec6.pdf · 2010-09-09 ·...

Lecture 6 - Linear Classification

Fall 2010

Classification

I Today, we are going to talk about classifying, or separatingimage patches.

I Classification is the Swiss Army knife of computer vision

I Many important vision problems can be reduced to “Is this aor not”?

Basic Setup

I We want to separate the X’s and the O’s

I Today, we will see how to solve this (seemingly) simple taskmathematically

-5 0 5-6

-4

-2

0

2

4

6

Basic VocabularyI We’ll call each thing that we are classifying a pointI Each point is described by a set of numbers that we call

featuresI It is your job to decide on the features. You need to pick

features that help you do your job.I For instance, if you are looking for edges, you’ll probably want

to look at the response of different derivative filters.I In the graph below, every point is described by two features.I The point at index i in the dataset will be described as xi

-5 0 5-6

-4

-2

0

2

4

6

Separating Points Linearly

I In building mathematical models for classifying, we are goingto focus on dividing these points with a straight line.

I This is called linear classification

-5 0 5-6

-4

-2

0

2

4

6

Expressing Linear Separation Mathematically

I Given a single point x = (x, y), we can express theclassification of the point as

sign(ax+ by + c)

where a, b, and c are constants that define a line. We’ll haveto choose these somehow.

I This function will return a +1 if ax+ by + c is positive and−1 otherwise.

How this Translates Graphically

I Effectively, we are projecting every point onto a line.

I Every point projects to some point on the line. The sign ofthe location along the lines determines the classification of thepoint

-6 -4 -2 0 2 4 6

-4

-2

0

2

4

-0.9

3.5

-2.6

1.5

-2.3

2.1

How can we find the separating line?

I This separating line can be found by looking at the line

ax+ by + c = 0

-5 0 5-6

-4

-2

0

2

4

6

How do we get the parameters of this line?

I We’ll find the parameters using a machine learning approach

I We’ll form a training set of examples, along with the correctclassification.

I We’ll find the line that best separates the training examples.

I Now, we have to form a set of mathematical steps for findingthe line that “best separates” the training set.

Optimizing the parameters of the lineI Notice that the points that are the farthest from the red

separating line have the largest response.I In some sense, we can be more confident in a point’s

classification as the point gets farther from the separating line.I So, the bigger the magnitude of a response, the more

confident in the classification we can be.

-6 -4 -2 0 2 4 6

-4

-2

0

2

4

-0.9

3.5

-2.6

1.5

-2.3

2.1

Looking at the confidence in classification

I We can translate this response into a probability.

I We will use the logistic function to transform the responseinto a probability of the correct classification

I We will assume that each point will be labeled with a label l.l can take values +1 or −1.

P [l = +1|x] = 1

1 + exp(−(ax+ by + c))

Finding the line parameters

I Now, for any set of line constants, we can find out theprobability assigned to the correct label of each item in thetraining set.

N∏i=1

P [li|xi] =

N∏i=1

1

1 + exp(−li(axi + byi + c))

I We’ve inserted li into the exponent because, if li = −1

P [l = −1|x] = 1

1 + exp((ax+ by + c))

I We multiply the probabilities because we believe the pointsare drawn independently.

I Note that for each item, this number tells us how much themodel defined by that particular line believes that theground-truth right answer is the actual right answer.

Finding the line

N∏i=1

P [li|xi] =

N∏i=1

1


I This is also called the likelihood of the data

I Ideally, this likelihood should be as close to 1 as possible.

I We can find the parameters of the line by looking for the lineparameters that maximize this likelihood.

I In other words, find the set of line parameters that make theright answers have as high a probability as possible.

Finding the line

N∏i=1

P [li|xi] =

N∏i=1

1


I Before going on, we are going to do a quick math trick.

I Because the log (natural logarithm or ln) function ismonotonically increasing (i.e. it is always increasing), theparameters that maximize the above equation will alsomaximize

log

(N∏i=1

P [li|xi]

)=

N∑i=1

− log (1 + exp(−li(axi + byi + c)))

Finding the line

log

(N∏i=1

P [li|xi]

)=

N∑i=1

−log (1 + exp(−li(axi + byi + c)))

I Similarly, we can multiply this equation by −1 to change froma maximization problem to a minimization problem

I Leading to a function that we will call a “loss function” or“cost function” and can be denoted as L:

L =

N∑i=1

log (1 + exp(−li(axi + byi + c)))

I Our goal is to find the line parameters that minimize L.

Minimizing L

I We can minimize L by using its gradient:

∇L =

∂L

∂a

∂L

∂b

∂L

∂c

I The gradient can be viewed as a vector that points in the

direction of steepest ascent.

I So, the negative gradient points in the direction of steepestdescent.

An algorithm for minimizing L

I This leads to a simple algorithm for optimizing the lineparameters.

I We’ll create a vector

p =

abc

I The steps are:

1. Initialize p to some value p0

2. Repeat these steps:

2.1 Calculate ∇L using the current value of p (The value of thegradient depends on p!)

2.2 p← p+ η(−∇L)I We go in the direction of the negative gadient because that

this the direction of steepest descent.

An Example

Learning Line parameters

I How does this relate to learning line parameters?

I We differentiate with respect L and optimize.

I The following few slides show how the line changes as theparameters change.

An Example

How does this get used in computer vision

I Step 1 - Define a labeling task

I Step 2 - Get a bunch of training examples, with ground-truthlabels

I Step 3 - Train the classifier

I Step 4 - Get a bunch of test examples, also with ground-truthlabels

I Step 5 - Test classifier on test examples

Let’s talk more about step 3

I Each training example is numerically represented by a set offeatures.

I These features are stored in a vector. The vector for eachtraining example will be denoted xi, where the i representsthe ith training example.

I Each training example also has a label, liI If we denote the weight vector as w, then we can rewrite the

log-loss criterion as

L =

N∑i=1

log(1 + exp((−li(wTxi

))) (1)

I (In the earlier example, w = [abc])

Is it really the same?

I Compare

L =

N∑i=1

log (1 + exp(−li(axi + byi + c)))

I With

L =

N∑i=1

log(1 + exp((−li(wTxi

))) (2)

I What’s missing?

I In the upper equation c is a constant.

I This is called the bias term.

Function of the bias term

I Going back to this example:

-6 -4 -2 0 2 4 6

-4

-2

0

2

4

-0.9

3.5

-2.6

1.5

-2.3

2.1

How do we implement it?

I It’s actually quite easy

I Make one feature be equal to all ones

How does this get used in computer vision

I Step 1 - Define a labeling task

I Step 2 - Get a bunch of training examples, with ground-truthlabels

I Step 3 - Train the classifier

I Step 4 - Get a bunch of test examples, also with ground-truthlabels

I Step 5 - Test classifier on test examples

Steps 4 and Steps 5 - Testing

I We want to want to know how well the system will performon data that it has never seen

I Overfitting can be a problem, especially if there isn’t muchtraining data.

I In classification, we usually measure the percentage of pointscorrectly labeled.

From here,

I The next problem set will involve implementing alearning-based edge-detector.

cs department - lecture 6 - linear classificationmtappen/cap5415/lecs/lec6.pdf · 2010-09-09 ·...

Documents