lecture 6: k-nearest neighbors - github pagessaravanan-thirumuruganathan.github.io/cse5334... ·...
TRANSCRIPT
![Page 1: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/1.jpg)
Lecture 6: k-Nearest Neighbors
Instructor: Saravanan Thirumuruganathan
CSE 5334 Saravanan Thirumuruganathan
![Page 2: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/2.jpg)
Outline
1 Introduction to Classification
2 k-NN (Nearest Neighbor) Classifier
CSE 5334 Saravanan Thirumuruganathan
![Page 3: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/3.jpg)
In-Class Quizzes
URL: http://m.socrative.com/Room Name: 4f2bb99e
CSE 5334 Saravanan Thirumuruganathan
![Page 4: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/4.jpg)
Introduction to Classification
CSE 5334 Saravanan Thirumuruganathan
![Page 5: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/5.jpg)
Major Tasks in Data Mining
Predictive methods
Given some training data, build a model and use it to predictsome variables of interest for unseen data
Descriptive methods
Given some data, identify some significant, novel and usefulpatterns in the data that are interpretable by humans
CSE 5334 Saravanan Thirumuruganathan
![Page 6: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/6.jpg)
Data Mining Tasks
Classification, Regression: Predictive
Clustering, Association Rule mining: Descriptive
CSE 5334 Saravanan Thirumuruganathan
![Page 7: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/7.jpg)
Types of Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
CSE 5334 Saravanan Thirumuruganathan
![Page 8: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/8.jpg)
Supervised Learning
Dataset:Training (labeled) data: D = {(xi , yi )}xi ∈ Rd
Test (unlabeled) data: x0 ∈ Rd
Tasks:Classification: yi ∈ {1, 2, . . . ,C}Regression: yi ∈ R
Objective: Given x0, predict y0
Supervised learning as yi was given during training
CSE 5334 Saravanan Thirumuruganathan
![Page 9: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/9.jpg)
Unsupervised Learning
Given: dataset D = {xi}Objective: Find interesting patterns without explicitsupervision
Tasks:ClusteringOutlier detectionDimensionality reductionMany more
CSE 5334 Saravanan Thirumuruganathan
![Page 10: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/10.jpg)
Reinforcement Learning
Training “agents” to take actions to maximize rewards
Reinforcement is given via action-reward
Objective: Find out what is the optimal action a to take whenin state x, in order to maximize long-term reward
Examples: Learning correct answers from score, self-drivingcars, learning to fly helicopters autonomously, learning to playgames
CSE 5334 Saravanan Thirumuruganathan
![Page 11: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/11.jpg)
Classification Methods
Model based: Build a (simple) model from the training dataand use it to predict unseen data
Memory based: Keep in memory all training data and use itto predict unseen data
CSE 5334 Saravanan Thirumuruganathan
![Page 12: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/12.jpg)
Classification Models
Some of the methods we will discuss in the class:
Tree based: Decision and Regression trees
Instance based: Nearest Neighbor
Bayesian and Naive Bayes
Neural Networks and Deep Learning
Support Vector Machines
CSE 5334 Saravanan Thirumuruganathan
![Page 13: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/13.jpg)
Binary and Multi-Class Classification
C = 2: Predict which of the two classes for the unseen record
Spam or Ham for emailsBenign or malignant for tumours
C > 2: Multi-Class classification - predict the right class.
Categorize mail as important, social, unimportantIdentify color of eyesIdentify wine type from features
Multi-class classification is often much more harder
CSE 5334 Saravanan Thirumuruganathan
![Page 14: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/14.jpg)
Trade-offs
Prediction accuracy versus interpretability
Good fit versus over-fit or under-fit
Parsimony versus black-box
CSE 5334 Saravanan Thirumuruganathan
![Page 15: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/15.jpg)
Classification Design Cycle1
1 Collect data and labels (the real effort)
2 Choose features (the real ingenuity)
3 Pick a classifier (some ingenuity)
4 Train the classifier (some knobs, fairly mechanical)
5 Evaluate the classifier (needs care)
1http://www.cs.sun.ac.za/~kroon/courses/machine_learning/
lecture2/kNN-intro_to_ML.pdf
CSE 5334 Saravanan Thirumuruganathan
![Page 16: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/16.jpg)
k-NN Classifier
CSE 5334 Saravanan Thirumuruganathan
![Page 17: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/17.jpg)
Instance based Classifiers
Store ALL the training data
Use the training data to predict class label for a new record
Common Examples:
Rote-Learner: Memorize entire training data, predict value ifthe new record matches some training dataNearest Neighbor: Use k points closest to new record toperform classification
CSE 5334 Saravanan Thirumuruganathan
![Page 18: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/18.jpg)
Nearest Neighbor Methods
Non-parametric, model-free approaches
Formalized in 1960s
Simple to understand and implement
CSE 5334 Saravanan Thirumuruganathan
![Page 19: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/19.jpg)
Why k-NN
One of the top-10 Data Mining algorithms2
1-NN Error bounds:When number of training data n tends to ∞ in a C -Classproblem then the 1-NN error rate (1NNER) is bounded by
BER ≤ 1NNER ≤ BER ×(
2− C
C − 1× BER
)1-NN Error rate is at most twice that of BER
Asymptotically Consistent: With infinite training data andlarge enough k , k-NN approaches the best possible classifier(Bayes Optimal)
2http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf
CSE 5334 Saravanan Thirumuruganathan
![Page 20: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/20.jpg)
k-Nearest Neighbor
Distance Metric: To compute the similarities between records
k: How many neighbors to look at?
A weighting function (optional)
Decision strategy: Often simple majority voting
CSE 5334 Saravanan Thirumuruganathan
![Page 21: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/21.jpg)
k-NN Algorithm
1 Compute the test point’s distance from each training point
2 Sort the distances in ascending (or descending) order
3 Use the sorted distances to select the k nearest neighbors
4 Use majority rule (for classification) or averaging (forregression)
CSE 5334 Saravanan Thirumuruganathan
![Page 22: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/22.jpg)
1-NN Example3
3http://www.lkozma.net/knn2.pdf
CSE 5334 Saravanan Thirumuruganathan
![Page 23: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/23.jpg)
k-NN Example4
4http://www.lkozma.net/knn2.pdf
CSE 5334 Saravanan Thirumuruganathan
![Page 24: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/24.jpg)
Distance Metric
Used to compute similarity between entities
If all values are numeric, Euclidean measure is often used
CSE 5334 Saravanan Thirumuruganathan
![Page 25: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/25.jpg)
Voronoi Cells in 2D5
5http://www.lkozma.net/knn2.pdfCSE 5334 Saravanan Thirumuruganathan
![Page 26: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/26.jpg)
Distance Metric
CSE 5334 Saravanan Thirumuruganathan
![Page 27: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/27.jpg)
Feature Normalization
Features should be on the same scale
Example: if one feature has its values in millimeters andanother has in centimeters, we would need to normalize
Common way: Center and Normalize to get 0 mean and unitvariance
zi =xi − xiσ
CSE 5334 Saravanan Thirumuruganathan
![Page 28: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/28.jpg)
Finding Optimal k
Often k-NN has lower error rate than 1-NN
But the error does not monotonically decrease
Picking k: Cross validation
CSE 5334 Saravanan Thirumuruganathan
![Page 29: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/29.jpg)
Impact of k6
6http://courses.cs.tamu.edu/rgutier/cs790_w02/l8.pdf
CSE 5334 Saravanan Thirumuruganathan
![Page 30: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/30.jpg)
Impact of k7
7ISLRCSE 5334 Saravanan Thirumuruganathan
![Page 31: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/31.jpg)
Impact of k8
8ISLR CSE 5334 Saravanan Thirumuruganathan
![Page 32: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/32.jpg)
Impact of k9
9ISLRCSE 5334 Saravanan Thirumuruganathan
![Page 33: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/33.jpg)
Impact of k10
Small k
Creates many small regions for each classMay lead to non-smooth decision boundaries and overfitLeads to higher variance (i.e. classifier is less stable)
Large k
Creates fewer larger regionsUsually leads to smoother decision boundaries (although, toosmooth boundaries might underfit)Leads to higher bias (i.e. classifier is less precise)
10http://www.cs.cornell.edu/courses/CS4758/2013sp/materials/
cs4758-knn-lectureslides.pdf
CSE 5334 Saravanan Thirumuruganathan
![Page 34: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/34.jpg)
Weighted k-NN
Often you might want to use some weights
Typically to give higher weights to points nearby than topoints that are farther
One possibility: 1dist2
(i.e. inverse of squared distance)
Alternatively, give more weight to similarity on importantfeatures
dist(xi , xj) =d∑
k=1
wk dist(xik , xjk)
CSE 5334 Saravanan Thirumuruganathan
![Page 35: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/35.jpg)
Computational Complexity
O(nd) where n is training set size and d is the number ofdimensions
VERY expensive, computationally
Often, special data structures such as Voronoi diagrams,KD-trees are used to speed things up.
CSE 5334 Saravanan Thirumuruganathan
![Page 36: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/36.jpg)
Other Things to Watch Out
Missing data (features) will cause problems
Sensitive to class outliers
Sensitive to irrelevant features (so ensure feature engineeringand normalization are done first)
CSE 5334 Saravanan Thirumuruganathan
![Page 37: Lecture 6: k-Nearest Neighbors - GitHub Pagessaravanan-thirumuruganathan.github.io/cse5334... · k-NN Algorithm 1 Compute the test point’s distance from each training point 2 Sort](https://reader035.vdocuments.site/reader035/viewer/2022081402/5f0c3c3c7e708231d43466a8/html5/thumbnails/37.jpg)
Summary
Major Concepts:Major data mining tasks
Classification basics
k-NN, variants - pros and cons
CSE 5334 Saravanan Thirumuruganathan