machine learning ics 178 instructor: max welling visualization & k nearest neighbors
TRANSCRIPT
Machine LearningICS 178
Instructor: Max Welling
visualization & k nearest neighbors
Types of Learning
• Supervised Learning• Labels are provided, there is a strong learning signal.• e.g. classification, regression.
• Semi-supervised Learning.
• Only part of the data have labels. • e.g. a child growing up.
• Reinforcement learning.• The learning signal is a (scalar) reward and may come with a delay.• e.g. trying to learn to play chess, a mouse in a maze.
• Unsupervised learning• There is no direct learning signal. We are simply trying to find structure in data.• e.g. clustering, dimensionality reduction.
Ingredients• Data:
• what kind of data do we have?
• Prior assumptions:• what do we know a priori about the problem?
• Representation:• How do we represent the data?
• Model / Hypothesis space:• What hypotheses are we willing to entertain to explain the data?
• Feedback / learning signal:• what kind of learning signal do we have (delayed, labels)?
• Learning algorithm:• How do we update the model (or set of hypothesis) from feedback?
• Evaluation:• How well did we do, should we change the model?
Data Preprocessing• Before you start modeling the data, you want to have a look at it to get a “feel”.
• What are the “modalities” of the data: e.g. • Netflix: users and movies• Text: words-tokens and documents• Video: pixels, frames, color-index (R,G,B)
• What is the domain?• Netflix: rating-values [1,2,3,4,5,?]• Text: # times a word appears: [0,1,2,3,...]• Video: brightness value: [0,..,255] or real-valued.
• Are there missing data-entries?
• Are there outliers in the data? (perhaps a typo?)
Data Preprocessing
• Often it is a good idea to compute the mean and variance of the data.
•
• Mean gives you a sense of location, Variance/STD a sense of scale.
• Better even is to histogram the data: Tricky issue: how do you choose the bin-size: too small: you see noise, too big: it’s one clump.
N
niiiini
N
nini XVarXSTDXEX
NXVARX
NXE
1
2
1
][][])[(1
][1
][
mean variance standard deviation
Preprocessing• For netflix you can histogram this for both modalities:
• The rating distribution over users for a movie.• The rating distribution over movies for a user.• The rating distribution over users for all movies jointly.• The rating distribution over all movies for all users jointly.
• You can compute properties and plot them against each other. For example:
• Compute the the user-specific mean variance over movies and plot a scatter plot:
user-mean
user
-var
ianc
e
every dot is a different user
Scatter-Plots
This shows all the 2-D projections of the“Iris data”.
Color indicates the classof iris.
How many attributesdo we have for Iris?
3-D visualization
contour plot meshgrid plot
Embeddings
• Every red dot represents an image.
• An image has +/- 1000 pixels
• Each image is projected to a 2-D space
• Projections are such that similar images are projected to similar locations in the 2-D embedding.
• This gives us an idea how the data is organized.
These plots are produced by “local linear embedding”
http://www.cs.toronto.edu/~roweis/lle/
Embeddings
Visualization by Clustering
By performing a clustering of the data and looking at the cluster-prototypesyou can get an idea of the type of data.
Preprocessing
• Often it is useful to “standardize” (or “whiten”) the data before you start modeling.
• The idea is to remove the mean and the variance so that your algorithm can focus on more sophisticated (higher order) structure.
!
][)2
][)1
orderthatIn
XSTD
XX
XEXX
i
inin
iinin
Be Creative!
WEKA DEMO