comp 406 lecture 09 artificial intelligencecsyliu/course/comp319/lecture/lecture09.pdf · comp 406...

COMP 406 Lecture 09

Artificial fIntelligenceIntelligence

Fiona Yan LiuDepartment of Computing

The Hong Kong Polytechnic University

Inductive LearningInductive Learning

Si l t f Simplest form learn a function f from examples pair (x, f(x))

Problem Given a training set of examples Find a hypothesis h such that h ≈ f

Nov. 17, 2015 Classification 2

Inductive LearningInductive Learning

Ho do e kno that the h pothesis h is close to the target f nction f if e How do we know that the hypothesis h is close to the target function f if we don't know what f is? How many examples do we need to get a good h?

What hypothesis space should we use?

How complex should h be?

How do we avoid overfitting? How do we avoid overfitting?


Univariate Linear RegressionUnivariate Linear Regression

Regression ith a Regression with a univariate linear function is also known as “fitting a straight line”. A i i t li A univariate linear function (a straight line) with input x and output yp p yhas the form y = w1x + w0L f i L (h ) Loss function: Loss(hw) = ∑j(yj – (w1xj+w0))


Linear ClassificationLinear Classification

Li l ifi i b i d h k f Linear classification can be viewed as the task of finding the linear separator that can separate different classes in the feature space:classes in the feature space:

wTx + b = 0wTx + b > 0

wTx + b < 0

f(x) = sign(wTx + b)


Probably Approximately Correct LearningA h th i th t i i l ill l t Any hypothesis that is seriously wrong will almost certainly be found out with high probability after a small number of examplesp It will make an incorrect prediction.

Any hypothesis that is consistent with a sufficiently Any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong It must be probably approximately correct

Probably approximately correct learning algorithm Any learning algorithm that returns hypotheses that are probably approximately correct


Correctness of HypothesisCorrectness of Hypothesis

A h pothesis is called appro imatel correct if (h)≤ɛ here ɛ is a A hypothesis is called approximately correct if error(h)≤ɛ, where ɛ is a small constant error(h) : the error rate of a hypothesis

For a “seriously wrong” hypothesis h belonging to H we have error(h )>ɛ so For a seriously wrong hypothesis hb belonging to Hbad, we have error(hb)>ɛ, so the probability that it agrees with the first N examples is

P(hb agrees with N examples) ≤ (1-ɛ)N

The probability that Hb d contains at least one consistent hypothesis is The probability that Hbad, contains at least one consistent hypothesis is bounded byP(Hbad contains a consistent hypothesis) ≤|Hbad|(1‐ɛ)N ≤|H|(1‐ɛ)N

We would like the probability of this event below some small number δ: We would like the probability of this event below some small number δ:|H|(1‐ɛ)N ≤ δ

Given that 1‐ɛ ≤ e‐ɛ, we can achieve this if we allow the algorithm to see N ≥ ln(|H|/ δ)/ɛN ≥ ln(|H|/ δ)/ɛ

Examples This number N, as a function of ɛ and δ, is called the sample complexity of the

hypothesis space


How to Choose the Hypothesis space To obtain real generalization to unseen examples,

then, it seems we need to restrict the hypothesis space.

Several ways to restrict the hypothesis spacey yp p to bring prior knowledge

i i h h l i h j i to insist that the algorithm return not just any consistent hypothesis, but preferably a simple one

f l bl b f h i h h i to focus on learnable subsets of the entire hypothesis space


Linear Classification with Hard Threshold Linear functions can be used to do classification

as well as regression:

hw(x) = Threshold(wTx) h Th h ld( ) 1 if > 0 d 0 th i where Threshold(z)= 1 if z > 0 and 0 otherwise.

Since the loss function is undifferentiable, we cannot obtain the solution as in the regression problem.p

wi← wi + α(y-hw(x))xi


Linear Classification with Logistic Regression The problems of linear classification with hard

threshold: hw(x) is not differentiable Linear classifier always announces a completely

confident prediction of 1 or 0 even for examples close to the boundary

hw(x) = Logistic(wTx) = 1/(1+e-wTx) w ← w + α(y h (x))(1 h (x))h (x)x wi← wi + α(y-hw(x))(1-hw(x))hw(x)xi


Multiple SolutionsMultiple Solutions

Whi h f h li i h b ? Which of the linear separators is the best?


Classification MarginClassification Margin

Di f l h i | ( ) | /T b Distance from example xi to the separator is Examples closest to the hyperplane are support vectors.

i f h i h di b

| ( ) | /Tir b w x w

Margin ρ of the separator is the distance between support vectors.

r

ρ

r

Nov. 17, 2015

Example of SVMExample of SVM

Given two support vectors (5, 1) belonging to class 1 ( ) g g (‐1, ‐1) belonging to class 2

Calculate w and w Calculate w1 and w0 Make the linear hyperplane y = w1x + w0 give the maximummarginmaximum margin

Y = 3X+6


The kernel functionThe kernel function

The linear classifier relies on inner product between vectors K(xi,xj)=xi

Txj

If d t i t i d i t hi h di i l i If every datapoint is mapped into high-dimensional space via some transformation Φ: x→ φ(x), the inner product becomes:

K(x x )= φ(x ) Tφ(x )K(xi,xj)= φ(xi) Tφ(xj) A kernel function is a function that is eqiuvalent to an inner

product in some feature space Every semi-positive definiteproduct in some feature space. Every semi-positive definite symmetric function is a kernel.

A kernel function implicitly maps data to a high-dimensional A kernel function implicitly maps data to a high dimensional space (without the need to compute each φ(x) explicitly).

Nov. 17, 2015 Lecture 8: Beyond Basic Learning 14

Nonlinear SVM

G l id th i i l f t l b

Nonlinear SVM

General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable.

Φ: x→ φ(x)Φ: x→ φ(x)

Nov. 17, 2015

The Kernel Function

Th li l ifi li i d t b t

The Kernel Function

The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj

If every datapoint is mapped into high‐dimensional If every datapoint is mapped into high dimensional space via some transformation Φ: x→ φ(x), the inner product becomes:

K(xi,xj)= φ(xi) Tφ(xj) A kernel function is a function that is eqiuvalent to an inner prod ct in some feat re space E er semian inner product in some feature space. Every semi‐positive definite symmetric function is a kernel.

A kernel function implicitly maps data to a high‐ A kernel function implicitly maps data to a highdimensional space (without the need to compute each φ(x) explicitly).

Nov. 17, 2015 16

Examples of Kernel FunctionsExamples of Kernel Functions

Linear: K(xi,xj)= xiTxjMapping Φ: x→ φ(x), where φ(x) is x itself

Polynomial of power p: K(xi,xj)= (1+ xiTxj)p

pdy p p ( i j) ( i j)

Mapping Φ: x→ φ(x), where φ(x) has dimensions 2

ji xx

p

pd

Gaussian (radial‐basis function): K(xi,xj) =Mapping Φ: x→ φ(x), where φ(x) is infinitedimensional:

d f ( G )

22e

every point is mapped to a function (a Gaussian); combination of functions for support vectors is the separator.

Nov. 17, 2015

separator.

Examples of SVM with Polynomial Kernel

Nov. 17, 2015

Examples of SVM with Radial‐basis Function Kernel

Nov. 17, 2015

Properties of SVM

Fl ibilit i h i i il it f ti

Properties of SVM

Flexibility in choosing a similarity function Sparseness of solution when dealing with large data setssets Only support vectors are used to specify the separating hyperplane

Ability to handle large feature spaces Complexity does not depend on the dimensionality of th f tthe feature space

Nice math property A simple convex optimization problem which is guaranteed to converge to a single global solution

Nov. 17, 2015

comp 406 lecture 09 artificial intelligencecsyliu/course/comp319/lecture/lecture09.pdf · comp 406...

Documents