comp 406 lecture 09 artificial intelligencecsyliu/course/comp319/lecture/lecture09.pdf · comp 406...
TRANSCRIPT
COMP 406 Lecture 09
Artificial fIntelligenceIntelligence
Fiona Yan LiuDepartment of Computing
The Hong Kong Polytechnic University
Inductive LearningInductive Learning
Si l t f Simplest form learn a function f from examples pair (x, f(x))
Problem Given a training set of examples Find a hypothesis h such that h ≈ f
Nov. 17, 2015 Classification 2
Inductive LearningInductive Learning
Ho do e kno that the h pothesis h is close to the target f nction f if e How do we know that the hypothesis h is close to the target function f if we don't know what f is? How many examples do we need to get a good h?
What hypothesis space should we use?
How complex should h be?
How do we avoid overfitting? How do we avoid overfitting?
Nov. 17, 2015 Classification 3
Univariate Linear RegressionUnivariate Linear Regression
Regression ith a Regression with a univariate linear function is also known as “fitting a straight line”. A i i t li A univariate linear function (a straight line) with input x and output yp p yhas the form y = w1x + w0L f i L (h ) Loss function: Loss(hw) = ∑j(yj – (w1xj+w0))
Nov. 17, 2015 Classification 4
Linear ClassificationLinear Classification
Li l ifi i b i d h k f Linear classification can be viewed as the task of finding the linear separator that can separate different classes in the feature space:classes in the feature space:
wTx + b = 0wTx + b > 0
wTx + b < 0
f(x) = sign(wTx + b)
Nov. 17, 2015 Classification 5
Probably Approximately Correct LearningA h th i th t i i l ill l t Any hypothesis that is seriously wrong will almost certainly be found out with high probability after a small number of examplesp It will make an incorrect prediction.
Any hypothesis that is consistent with a sufficiently Any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong It must be probably approximately correct
Probably approximately correct learning algorithm Any learning algorithm that returns hypotheses that are probably approximately correct
Nov. 17, 2015 Classification 6
Correctness of HypothesisCorrectness of Hypothesis
A h pothesis is called appro imatel correct if (h)≤ɛ here ɛ is a A hypothesis is called approximately correct if error(h)≤ɛ, where ɛ is a small constant error(h) : the error rate of a hypothesis
For a “seriously wrong” hypothesis h belonging to H we have error(h )>ɛ so For a seriously wrong hypothesis hb belonging to Hbad, we have error(hb)>ɛ, so the probability that it agrees with the first N examples is
P(hb agrees with N examples) ≤ (1-ɛ)N
The probability that Hb d contains at least one consistent hypothesis is The probability that Hbad, contains at least one consistent hypothesis is bounded byP(Hbad contains a consistent hypothesis) ≤|Hbad|(1‐ɛ)N ≤|H|(1‐ɛ)N
We would like the probability of this event below some small number δ: We would like the probability of this event below some small number δ:|H|(1‐ɛ)N ≤ δ
Given that 1‐ɛ ≤ e‐ɛ, we can achieve this if we allow the algorithm to see N ≥ ln(|H|/ δ)/ɛN ≥ ln(|H|/ δ)/ɛ
Examples This number N, as a function of ɛ and δ, is called the sample complexity of the
hypothesis space
Nov. 17, 2015 Classification 7
How to Choose the Hypothesis space To obtain real generalization to unseen examples,
then, it seems we need to restrict the hypothesis space.
Several ways to restrict the hypothesis spacey yp p to bring prior knowledge
i i h h l i h j i to insist that the algorithm return not just any consistent hypothesis, but preferably a simple one
f l bl b f h i h h i to focus on learnable subsets of the entire hypothesis space
Nov. 17, 2015 Classification 8
Linear Classification with Hard Threshold Linear functions can be used to do classification
as well as regression:
hw(x) = Threshold(wTx) h Th h ld( ) 1 if > 0 d 0 th i where Threshold(z)= 1 if z > 0 and 0 otherwise.
Since the loss function is undifferentiable, we cannot obtain the solution as in the regression problem.p
wi← wi + α(y-hw(x))xi
Nov. 17, 2015 Classification 9
Linear Classification with Logistic Regression The problems of linear classification with hard
threshold: hw(x) is not differentiable Linear classifier always announces a completely
confident prediction of 1 or 0 even for examples close to the boundary
hw(x) = Logistic(wTx) = 1/(1+e-wTx) w ← w + α(y h (x))(1 h (x))h (x)x wi← wi + α(y-hw(x))(1-hw(x))hw(x)xi
Nov. 17, 2015 Classification 10
Multiple SolutionsMultiple Solutions
Whi h f h li i h b ? Which of the linear separators is the best?
Nov. 17, 2015 Classification 11
Classification MarginClassification Margin
Di f l h i | ( ) | /T b Distance from example xi to the separator is Examples closest to the hyperplane are support vectors.
i f h i h di b
| ( ) | /Tir b w x w
Margin ρ of the separator is the distance between support vectors.
r
ρ
r
Nov. 17, 2015
Example of SVMExample of SVM
Given two support vectors (5, 1) belonging to class 1 ( ) g g (‐1, ‐1) belonging to class 2
Calculate w and w Calculate w1 and w0 Make the linear hyperplane y = w1x + w0 give the maximummarginmaximum margin
Y = 3X+6
Nov. 17, 2015 Classification 13
The kernel functionThe kernel function
The linear classifier relies on inner product between vectors K(xi,xj)=xi
Txj
If d t i t i d i t hi h di i l i If every datapoint is mapped into high-dimensional space via some transformation Φ: x→ φ(x), the inner product becomes:
K(x x )= φ(x ) Tφ(x )K(xi,xj)= φ(xi) Tφ(xj) A kernel function is a function that is eqiuvalent to an inner
product in some feature space Every semi-positive definiteproduct in some feature space. Every semi-positive definite symmetric function is a kernel.
A kernel function implicitly maps data to a high-dimensional A kernel function implicitly maps data to a high dimensional space (without the need to compute each φ(x) explicitly).
Nov. 17, 2015 Lecture 8: Beyond Basic Learning 14
Nonlinear SVM
G l id th i i l f t l b
Nonlinear SVM
General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable.
Φ: x→ φ(x)Φ: x→ φ(x)
Nov. 17, 2015
The Kernel Function
Th li l ifi li i d t b t
The Kernel Function
The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj
If every datapoint is mapped into high‐dimensional If every datapoint is mapped into high dimensional space via some transformation Φ: x→ φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj) A kernel function is a function that is eqiuvalent to an inner prod ct in some feat re space E er semian inner product in some feature space. Every semi‐positive definite symmetric function is a kernel.
A kernel function implicitly maps data to a high‐ A kernel function implicitly maps data to a highdimensional space (without the need to compute each φ(x) explicitly).
Nov. 17, 2015 16
Examples of Kernel FunctionsExamples of Kernel Functions
Linear: K(xi,xj)= xiTxjMapping Φ: x→ φ(x), where φ(x) is x itself
Polynomial of power p: K(xi,xj)= (1+ xiTxj)p
pdy p p ( i j) ( i j)
Mapping Φ: x→ φ(x), where φ(x) has dimensions 2
ji xx
p
pd
Gaussian (radial‐basis function): K(xi,xj) =Mapping Φ: x→ φ(x), where φ(x) is infinitedimensional:
d f ( G )
22e
every point is mapped to a function (a Gaussian); combination of functions for support vectors is the separator.
Nov. 17, 2015
separator.
Examples of SVM with Polynomial Kernel
Nov. 17, 2015
Examples of SVM with Radial‐basis Function Kernel
Nov. 17, 2015
Properties of SVM
Fl ibilit i h i i il it f ti
Properties of SVM
Flexibility in choosing a similarity function Sparseness of solution when dealing with large data setssets Only support vectors are used to specify the separating hyperplane
Ability to handle large feature spaces Complexity does not depend on the dimensionality of th f tthe feature space
Nice math property A simple convex optimization problem which is guaranteed to converge to a single global solution
Nov. 17, 2015