support vector machines arthur j. parzygnat

Cupcakes versus muffinsSupport vector machines

Arthur J. Parzygnat

University of ConnecticutSIGMA

January 26, 2018

Arthur J. Parzygnat (University of ConnecticutSIGMA) Cupcakes versus muffins January 26, 2018 1 / 32

Outline

1 Cupcakes and muffinsWhat’s the difference?

2 HyperplanesParametrizing hyperplanesMarginal hyperplanesThe definition of a support vector machine

3 Existence and uniqueness of SVMsEnlarging marginsExistence of SVMs

4 A simple example of an SVM


Cupcakes and muffins What’s the difference?

What’s the difference?


Hyperplanes Parametrizing hyperplanes

Parametrizing hyperplanes

Lemma 1

Let H ⊆ Rn be a hyperplane. Then there exists a vector ~w ∈ Rn \ {~0} anda real number c ∈ R such that H is the solution set of 〈~w , ~x〉 − c = 0.

Proof.

Let ~h ∈ H. Then H − ~h is an(n− 1)-dimensional subspace in Rn.Hence, (H − ~h)⊥ is aone-dimensional subspace spannedby some normalized vector u.Because span{u} is perpendicularto H, there exists an a ∈ R suchthat au ∈ H. Then, H is thesolution set of 〈u, ~x〉 − a = 0.

•H

~h

u

au


Hyperplanes Marginal hyperplanes

Marginal planes

Note that a hat over a vector signifies that it is a unit vector (hasmagnitude 1). There are many ways to describe hyperplanes. It should notbe obvious why it is preferable to use the description given by a vector anda number. Furthermore, notice that there are redundancies with thisparametrization. For example, (λ~w , λc) and (~w , c) describe the samehyperplane for all λ ∈ R \ {0}.

Definition 2

Let ~w ∈ Rn \ {~0} and c ∈ R with associated plane H given by the solutionset of 〈~w , ~x〉 − c = 0. The marginal planes H+ and H− associated to(~w , c) are the solution sets to 〈~w , ~x〉 − c = 1 and 〈~w , ~x〉 − c = −1,respectively, i.e.

H± :={~x ∈ Rn : 〈~w , ~x〉 − c = ±1

}.



Marginal planes example

Example 3

In R2, if ~w =1

3

[31

]and c = 4, then

the marginal planes would look likethe figure on the right since they aredescribed by

H :1

3(3x + y)− 4 = 0

H+ :1

3(3x + y)− 4 = 1

H− :1

3(3x + y)− 4 = −1 •

HH− H+

~w

Notice that the vector

(c

‖~w‖2

)~w =

6

5

[31

]lies on the plane H.



The distance between marginal hyperplanes

Lemma 4

Let (~w , c) ∈ (Rn \ {~0})× R describe a hyperplane H. The perpendiculardistance between H and H± is 1

‖~w‖ .

Proof.

Let ~x+ ∈ H+ and ~x ∈ H. Theorthogonal distance between H and

H+ is given by⟨~x+ − ~x , ~w

‖~w‖

⟩=

1‖~w‖(〈~w , ~x+〉 − 〈~w , ~x〉

)=

1‖~w‖((1 + c)− c

)= 1‖~w‖ by the

definition of H and H+ in terms of(~w , c). A similar calculation holds forH−. •

HH− H+

~w

~x+

~x

~x+−~x

w‖~w‖



The margin and margin width

Definition 5

Let (~w , c) ∈ (Rn \ {~0})× R describe a hyperplane H. The convex regionbetween H+ and H− is called the margin of (~w , c). The orthogonaldistance between H+ and H−, which is given by 2

‖~w‖ , is called the margin

width of (~w , c).

Even though (~w , c) can be scaled to (λ~w , λc) to give the same H for anyλ ∈ R \ {0}, notice that the marginal planes are different. This is becausethe margin width has scaled by a factor of 1

λ .



Parametrizing hyperplanes and their margins

Lemma 6

An ordered pair of distinct parallel hyperplanes (H−,H+) in Rn determinea unique (~w , c) ∈

(Rn \ {~0}

)× R whose marginal planes agree with H−

and H+.

Proof.

Let ~x+ ∈ H+ and pick u ∈ (H+ − ~x+)⊥ suchthat if λu ∈ H− and µu ∈ H+, then λ < µ (i.e.u is perpendicular to H+ and points from H−to H+). Also, let ~x− ∈ H−. Since theorthogonal separation between the planes H+

and H− is 〈~x+ − ~x−, u〉, set ~w :=(

2〈~x+−~x−,u〉

)u

and c := 〈~u,~x++~x−〉〈~u,~x+−~x−〉 . Then (~w , c) has H+ & H−

as its marginal planes.•

H− H+u

~x+

~x−

~x+ − ~x−



Parametrizing hyperplanes and their margins

Corollary 7

There is a 1-1 correspondence between the set of (ordered) pairs of parallelhyperplanes (the marginal hyperplanes) and the set

(Rn \ {~0}

)× R.

Henceforth, we will abuse notation a bit so that when we say “hyperplane,”we will always mean an element of

(Rn \ {~0}

)×R unless otherwise stated.


Hyperplanes The definition of a support vector machine

Separating hyperplanes

Definition 8

Let (X ,X+,X−) denote a non-empty set X of vectors in Rn that areseparated into the two (disjoint) non-empty sets X+ and X−. Such acollection of sets is called a training data set. A hyperplane H ⊆ Rn,described by (~w , c) ∈ (Rn \ {~0})× R, separates (X ,X+,X−) iff

〈~w , ~x+〉 − c > 0 & 〈~w , ~x−〉 − c < 0

for all ~x+ ∈ X+ and for all ~x− ∈ X−. In this case, H is said to be aseparating hyperplane for (X ,X+,X−). H marginally separates(X ,X+,X−) iff

〈~w , ~x+〉 − c ≥ 1 & 〈~w , ~x−〉 − c ≤ −1

for all ~x+ ∈ X+ and for all ~x− ∈ X−.



Support vector machine

Definition 9

Let (X ,X+,X−) denote a training data set in Rn. LetSX ⊆ (Rn \ {~0})× R denote the set of hyperplanes that marginallyseparate (X ,X+,X−). Let f : SX → R be the function defined by

(Rn \ {~0})× R 3 (~w , c) 7→ f (~w , c) :=2

‖~w‖,

i.e. the margin. A support vector machine (SVM) for (X ,X+,X−) is amaximum of f , i.e. an SVM is a marginally separating hyperplane(~w , c) ∈ (Rn \ {~0})× R such that 1

‖~w ′‖ ≤1‖~w‖ for every other marginally

separating hyperplane (~w ′, c ′) ∈ (Rn \ {~0})× R.



Support vector machine

Example 10

An SVM is a hyperplane that maximizes the margin

HH− H+

- --- -

- -

-

+

+ +

+

+ +

+



The support vectors of a marginally separating hyperplane

How will we know when we have found an SVM? The key to determiningmarginally separating hyperplanes will be in the possible support vectors.Our goal will then be to maximize the margin over all possible supportvectors.

Definition 11

Let (X ,X+,X−) be a training data set and let (~w , c) be a marginallyseparating hyperplane for this set. The elements of X ∩ H± are calledsupport vectors for (~w , c). The set of support vectors is denoted by Hsupp

X .The notation Hsupp

X± := HsuppX ∩ H± will also be used to denote the set of

positive and negative support vectors.



Examples of support vectors

Example 12

Two examples of support vectors have been circled for two differentseparating marginal hyperplanes for the same training data set.

HH− H+

- --- -

- -

-

+

+ +

+

+ +

+HH− H+

- --- -

- -

-

+

+ +

+

+ +

+


Existence and uniqueness of SVMs Enlarging margins

“Enlarging the margins” Lemma Statement

Lemma 13

Let (X ,X+,X−) be a training data set and let (~w , c) be a marginallyseparating hyperplane for this set. Then there exists a unique marginallyseparating hyperplane (~v , d) such that

v = w &2

‖~v‖= min

~x+∈X+~x−∈X−

〈~x+ − ~x−, w〉 .

In other words, if the marginal planes do not contain any of the trainingdata set, then the separating hyperplane can be translated and the marginwidth can be enlarged until the margin touches both positive and negativetraining data sets. We use the minimum orthogonal distance between thepositive and negative training data sets to define the new margin width.



“Enlarging the margins” Lemma Picture

H

K

H−

K−

H+

K+

- -

-

- -

- -

-

+

+ +

+

+ +

+



“Enlarging the margins” Lemma Proof

Proof.

Notice that(c−1‖~w‖

)w ∈ H−,

(c‖~w‖

)w ∈ H, and

(c+1‖~w‖

)w ∈ H+ provide

explicit vectors in each of these three hyperplanes. Set m+ to be theremaining minimum orthogonal distance between H+ and X+ and set m−to be the remaining minimum orthogonal distance between H− and X−,namely

m± := min~x±∈X±

θ(~x±)

(〈~x±, w〉 −

(c ± 1

‖~w‖

)),

where θ : X → R is the function defined by

X 3 ~x 7→ θ(~x) :=

{+1 if ~x ∈ X+

−1 if ~x ∈ X−.



“Enlarging the margins” Lemma Proof

Proof.

Therefore, the planes K± containing the vectors(c±1‖~w‖ ±m±

)w that are

perpendicular to w intersect X± but do not contain points of X on theinterior of their margin. By Lemma 6, there exists a(~v , d) ∈

(Rn \ {~0}

)× R that describes these marginal separating

hyperplanes. In fact, they can be given explicitly by

~v :=

(2

2‖~w‖ + m+ + m−

)w

and

d :=

⟨~v ,

(c + 1

‖~w‖+ m+

)w

⟩− 1 =

2c + ‖~w‖(m+ −m−)

2 + ‖~w‖(m+ + m−).


Existence and uniqueness of SVMs Existence of SVMs

Existence of SVMs

Theorem 14

Let (X ,X+,X−) be training data set for which there exists a separatinghyperplane for (X ,X+,X−). Then there exists a unique SVM for(X ,X+,X−).

Proof.

By Lemma 13, it suffices to maximize the margin function f on the subsetSsuppX ⊆ SX consisting of marginally separating hyperplanes that haveboth positive and negative support vectors, namely on

SsuppX :={

(~w , c) ∈ SX : H± ∩ X± 6= ∅}.

The goal is therefore to maximize the margin function, which is a functionof (~w , c), subject to the constraint

θ(~x)(〈~w , ~x〉 − c

)− 1 = 0 ∀ ~x ∈ SsuppX .



Proof of existence of SVMs

Proof.

Maximizing the margin function is equivalent to minimizing the function

(Rn \ {~0})× R 3 (~w , c)g7−→ 1

2‖~w‖2 −

∑~x∈X

α~x

(θ(~x)

(〈~w , ~x〉 − c

)− 1).

Here, α~x = 0 for all ~x ∈ X \ HsuppX (remember, Hsupp

X is the set of supportvectors for the given hyperplane) and α~x needs to be determined for all~x ∈ Hsupp

X . This condition guarantees that the function g equals 2f 2 when

restricted to SsuppX (but notice that it does not equal 2f 2 on the larger

domain SX of all marginally separating hyperplanes). The α~x are calledLagrange multipliers. The extrema of g occur at points (~w , c) for whichthe derivative of g vanishes with respect to these coordinates

∂g

∂~w

∣∣∣(~w ,c)

= 0 &∂g

∂c

∣∣∣(~w ,c)

= 0.




Proof.

The first of these two equations gives

~w =∑~x∈X

α~xθ(~x)~x , (1)

while the second equation gives∑~x∈X

α~xθ(~x) = 0, (2)

which is a condition that the Lagrange multipliers have to satisfy.Plugging in these results back into the function g gives

g(~w , c) =∑~x∈X

α~x −1

2

∑~x ,~y∈X

α~xα~yθ(~x)θ(~y)〈~x , ~y〉




Proof.

Setting ∂g∂α~x

∣∣∣(~w ,c)

= 0 for each ~x ∈ HsuppX will give additional conditions

that the Lagrange multipliers have to satisfy, which are given by

θ(~x)(〈~w , ~x〉 − c

)= 1

for each ~x ∈ HsuppX . Plugging in the earlier result (1) for ~w into this

constraint gives ∑~y∈X

α~yθ(~y)〈~y , ~x〉 − θ(~x) = c (3)

for each ~x ∈ HsuppX . However, there is one subtle point, and that is that we

do not know what HsuppX is. Nevertheless, there is still an optimization

procedure left over, and it is based on the different possible choices ofHsuppX , i.e. for each pair of nonempty subsets X± ⊆ X±.




Proof.

For each choice of HsuppX , one has the linear system∑

~y∈X

θ(~y)α~y = 0

{∑~y∈X

θ(~y)〈~y , ~x〉α~y − c = θ(~x)}~x∈X+∪X−

(4)

in the variables⋃

~x∈X+∪X−

{α~x} ∪ {c} obtained from equations (2) and (3).

This describes a linear system of |HsuppX |+ 1 equations in |Hsupp

X |+ 1variables. At least one of these systems is consistent because we haveassumed that the training data set can be separated. Hence, a solution tothe SVM problem exists since it is given by maximizing 1

‖~w‖ over the

(finite) set of pairs of nonempty subsets X± ⊆ X±.Arthur J. Parzygnat (University of ConnecticutSIGMA) Cupcakes versus muffins January 26, 2018 24 / 32

A simple example of an SVM

Example of an SVM

Consider the training data set given by

X+ :=

{~x+ :=

[01

]}& X− :=

{~x1− :=

[−1−2

], ~x2− :=

[2−1

]}.

•

-

-

+

The inner products are given by

〈~x+, ~x+〉 = 1

〈~x+, ~x1−〉 = −2

〈~x+, ~x2−〉 = −1

〈~x1−, ~x

1−〉 = 5

〈~x1−, ~x

2−〉 = 0

〈~x2−, ~x

2−〉 = 5



Example: case i) H suppX = {~x+, ~x

1−, ~x

2−}

For this case, (4) reads

α~x+−α~x1

−−α~x1

−= 0

α~x++2α~x1

−+α~x1

−−c = 1

−2α~x+−5α~x1

−−c = −1

−α~x+−5α~x1

−−c = −1

and the solution to this linear system is

α~x+=

5

16, α~x1

−=

1

8, α~x2

−=

3

16, c = −1

4.

The associated vector ~w and margin are therefore given by

~w =5

16

[01

]− 1

8

[−1−2

]− 3

16

[2−1

]=

1

4

[−13

]=⇒ 2

‖~w‖=

8√10.



Example: case i) H suppX = {~x+, ~x

1−, ~x

2−}

The hyperplanes are then given by the solutions to the three equations

H :1

4(−x + 3y) +

1

4= 0

H+ :1

4(−x + 3y) +

1

4= 1

H− :1

4(−x + 3y) +

1

4= −1

•

H+

H

H−

-

-

+



Example: case ii) H suppX = {~x+, ~x

1−}

For this case, α~x2−

= 0 and (4) reads

α~x+−α~x1

−= 0

α~x++2α~x1

−−c = 1

−2α~x+−5α~x1

−−c = −1


α~x+=

1

5, α~x1

−=

1

5, c = −2

5.


~w =1

5

[01

]− 1

5

[−1−2

]=

1

5

[13

]=⇒ 2

‖~w‖=

10√10.



Example: case ii) H suppX = {~x+, ~x

1−}


H :1

5(x + 3y) +

2

5= 0

H+ :1

5(x + 3y) +

2

5= 1

H− :1

5(x + 3y) +

2

5= −1

• H+

H

H−

-

-

+

Even though the margin is larger here, we must reject this solution becauseit is not a marginally separating hyperplane for the training data set.



Example: case iii) H suppX = {~x+, ~x

2−}

For this case, α~x1−

= 0 and (4) reads

α~x+−α~x1

−= 0

α~x++α~x1

−−c = 1

−α~x+−5α~x1

−−c = −1


α~x+=

1

4, α~x2

−=

1

4, c = −1

2.


~w =1

4

[01

]− 1

4

[2−1

]=

1

2

[−11

]=⇒ 2

‖~w‖=

2√2

=2√

5√10.



Example: case iii) H suppX = {~x+, ~x

2−}


H :1

5(x + 3y) +

2

5= 0

H+ :1

5(x + 3y) +

2

5= 1

H− :1

5(x + 3y) +

2

5= −1

•

H+

H

H−

-

-

+

We must also reject this solution because it is not a marginally separatinghyperplane for the training data set. Hence, our first (~w , c) is the SVM.


Thank you

Thank you!

Thanks for your attention!References:

1 Alice Zhao, Support Vector Machines: A Visual Explanation withSample Python Code (2017). Available athttps://youtu.be/N1vOgolbjSc.

2 Patrick Winston. 6.034 Artificial Intelligence “Lecture 16. Learning:Support Vector Machines.” Fall 2010. Massachusetts Institute ofTechnology: MIT OpenCourseWare, https://ocw.mit.edu.License: Creative Commons BY-NC-SA. Available athttps://youtu.be/_PwhiWxHK8o.

3 Benjamin Russo. Private Communication.4 Images of muffin from wikipedia

(https://en.wikipedia.org/wiki/Muffin)5 Images of cupcake from

https://littlecupcakes.com.au/product/red-velvet/.


https://youtu.be/N1vOgolbjSc

https://ocw.mit.edu

https://youtu.be/_PwhiWxHK8o

https://en.wikipedia.org/wiki/Muffin

https://littlecupcakes.com.au/product/red-velvet/

support vector machines arthur j. parzygnat

Documents