Download - Nearest Neighbors and Naive Bayes
Mario Martin
Nearest Neighbors andNaive Bayes
Mario Martin
Simple algorithms but effectiveTwo different methods:
Nearest Neighbor. Non parametric method: In this case a lazy Instance Based Learning method that does not build any model. Naïve Bayes. Parametric: It builds a probabilistic model of your data following some assumptions.
Baseline algorithms
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
Nearest Neighbor classifier
Instance Based Learning / Lazy Methods
Mario Martin
Lazy learning methods: they don’t build a model of the data
Assign the label to an observation depending on the labels of “closest” examples
Only requirements:A training setA similarity measure
Instance Based Learning
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
K‐NNDistance Weighted kNNHow to select K??How to solve some problems
Instance Based Learning Algorithms
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
K‐Nearest neighbor algorithmIt interprets each example as a point in a space defined by the features describing the dataIn that space a similarity measure allows as to classify new examples. Class is assigned depending on the K closest examples
K‐NN
Naive Bayes and Nearest Neighbor (10/2018)
1‐NN example
• Two real features (x1, x2) define the space. • Each red point is a positive example. Black points are negative examples
Equivalen to to draw the Voronoi space of yourdata.
1‐NN example
• Two real features (x1, x2) define the space. • Each red point is a positive example. Black points are negative examples
x
query point qf
nearest neighbor qi
Equivalen to to draw the Voronoi space of yourdata.
X new data is classified as positive
Mario Martin
Distance is a parameter of the algorithmWhen dataset is numeric, usually Euclidean:
In mixed data sets, Gower or any other appropriate distance measureCAVEAT: Data should be normalized or standardized in order to ensure same relevance to each feature in the computation of distance.
Distance measures
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
Advantages:Fast trainingAbility to learn very complex functions
Problems:Very slow in testing. Needed some smart structure representation of data in treesFooled by noiseFooled when irrelevant features
Some comments
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
Building more robust classifiersResults do not depend on the closest example but on the k clossest examples (so k‐nearest neighbours(kNN) name)
Some comments:
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
3‐Nearest Neighbors
query point qf
3 nearest neighbors
2x,1o
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
7‐Nearest Neighbors
query point qf
7 nearest neighbors
3x,4o
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
Parameters:Natural number k (odd number)Training setDistance measure
Algorithm:1. Store all training set <xi, label(xi)>2. Given new observation, xq, compute the nearest k
neighbors3. Let vote the nearest k neighbors to assign the label to
the new data.
K‐NN algorithm
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
High number of k show two advantages:Smother frontiersReduces sensibility to noise
But too large values are bad becauseWe loose locality in the decision because very distant points can interfere in assigning labelsComputation time is increased
K‐value usually is chosen by cross‐validation.
How to select k?
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
A smart variation of KNN.When voting, all k neighbors have the same influence, but some of them are more distant than the others (so the should influence less in decisions)
k=52 votes3 votes
Solution: Given more weight to closest examples
Distance Weighted kNN
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
Lets define a weigth for each of the k‐closest examples:
where xq is the query point, xi is the i‐closest example, d is the distance function and K is the kernel (a decreasing function with respect to distance function)
Predicted label for xq is computed according to:
where l(xi) is {‐1,1} the label of exemple xi, and wi is the weight of example xi
In previous example, it could be something like:
Distance Weighted kNN
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
Kernel functions
Examples of kernel functions
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
K‐NN is fooled when irrelevant features are widely present in the data set
For instance, examples are described using 20 attributes, but only 2 of them are relevant to the classification…
Solution consists in feature selection. For instance:Use weighted distance:
Limit weights to 0 and 1. Notice that setting zj = 0 means removing the featureFind weights z1,…,zn (one for each feature), that minimize error in a validation data set using cross‐validation
Problems with irrelevant features
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
Naïve Bayes
Probabilistic model
Mario Martin
From examples in the dataset, we can estimate the likelihood of our data:
read as probability to observe example with features (x1,x2... xn) [xi, represents feature i of observation x] in class ci
But, for classifying an observation (x1,x2... xn), we should look for the class that maximizes the probability of the observation belonging to the class:
Naive Bayes basics
1 2argmax ( | , , , )j
MAP j nc C
c P c x x x
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
We will use the Bayes’ theorem:
Bayes’ theorem :
),,,|(argmax 21 njCc
MAP xxxcPcj
Naïve Bayes classifiers
),,,()()|,,,(
argmax21
21
n
jjn
Cc xxxPcPcxxxP
j
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
P(cj) - Simply proportion of elements in class jP(x1,x2,…,xn|cj)
Problem |X|n.|C| parameters!It can only be estimated from a very huge dataset. Impractical
Solution: Independence assumption ( very Naïve) : attribute values are independent. So in this case, we can easily compute
Computing probabilities
1 2( , , , | ) ( | )n j i ji
P x x x c P x c
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
P(xk|cj) Now we only need n.|C| probability estimations
Very easy. Number of values with property xk in class cj over the complet number of cases in class cj
Solving now,the class assigned to a new observation is:
Computing probabilities
1 2( , , , | ) ( | )n j i ji
P x x x c P x c
1 2
1 2
( , , , | ) ( )argmax arg max ( ) ( | )
( , , , )j j
n j jNB j i j
c C c C in
P x x x c P cc P c P x c
P x x x
Equation to be used
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
Being probabilities in range 0..1, products quickly leadto floating‐point underflow errorsKnowing that log(xy) = log(x) + log(y), it is better to work with log(p) than with probabilities.
Now:
Practical issues
j
argmax log ( ) log ( | )NB j i jc C i positions
c P c P x c
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
Training set: X document corpusEach document is labeled with f(x)=like/dislikeGoal: Learn function that permits given new document if you like it or not.Questions:How do we represent documents? How to compute probabilities?
Example: Learning to classify texts
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
How do we represent documents? Each document is represented as a Bag of WordsAttributes: All words that appear in the documentSo each document is represented as a booleanvector with length N: 0 – word does not appear ; 1 – word appear
Practical problem: A very huge table.Solution : Use sparse representation of matrixes
Example: Learning to classify texts
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
Some numbers10.000 documents500 words per documentMaximum theoretical number of words: 50.000 (much less because of word repetitions)
Reducing the number of attributesRemoving the number (sing/plural) and verbal forms (stemming)Remove conjunctions, propositions and articles (stop words)Now we have about. 10.000 attributes
Example: Learning to classify texts
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
How to compute probabilities?First compute for each class [“a priori” probability for like anddislike classes]
Example: Learning to classify texts
Naive Bayes and Nearest Neighbor (10/2018)
NB i iiv {like,dislike}
v argmax P(v) P(x word | v)
like#documents like P(v ) =
total number of documents
dislike#documents dislike P(v ) =
total number of documents
i P(v )
Mario Martin
How to compute probabilities?Second, compute for each word:
Number of parameters to estimate is not too large: 10.000 wordsand two classes (so about 20.000)
Example: Learning to classify texts
Naive Bayes and Nearest Neighbor (10/2018)
NB i iiv {like,dislike}
v argmax P(v) P(x word | v)
i iP(x word | v)
k kk
# (docs. v in training where word apears) nP(word | v) #(documents v) n
• Problem:
–When nk is low, not an accurate probability–when nk is 0 for wordk for one class v, then any document with that word will never be assigned to v (independent of other appearing words)
k kk
# (docs. v del training on word apareix) nP(word | v) #(documents v) n
Example: Learning to classify texts
• Solution: More robust computation of probabilities (Laplace smoothing)
• Where:– nk is # of documents of class v in which word k appear – n is # of documents with label v– p it’s a likelihood estimation of “a priori” P(xk|v) (f.i., uniform distribution)
– m is the number of labels
k
kn mpP(word | v) n m
Example: Learning to classify texts
Example: Learning to classify texts
Smoothing:
More common “a priori” uniform distribution: 1. When two classes: p=1/2, m=2 (Laplace Rule)
2. Generic case (c classes): p = 1/c, m=c
k
kn 1P(x | v) n 2
k
kn mpP(x | v)
n m
c
kk
n 1P(x | v) n
Mario Martin
Naïve Bayes return good accuracy results even when independence assumption is not fulfilled In fact, Spam/not Spam implementation of Thunderbird work in this wayApplied to document filtering (fi. Newsgroups or incoming mails)
Learning and testing time are linear with the number of attributes!
Example: Learning to classify texts
Naive Bayes and Nearest Neighbor (10/2018)
Mario Martin
Assume each class follows a normal distribution for each variable
For instance 73 is average of feature temp. for class x, and std=6.2, we compute conditional prob in the following way:
Extension to continuous attributes
Naive Bayes and Nearest Neighbor (10/2018)