521495a: artificial intelligence - noppa.oulu.fi · pdf file521495a: artificial intelligence...
TRANSCRIPT
521495A: Artificial Intelligence
Lectured by
Abdenour Hadid
Associate Professor, CMVS, University of Oulu
Slides adopted from http://ai.berkeley.edu
Classification, Perceptron, Clustering
Q-Learning
Non-Deterministic Search
Solving MDPs
Q-Learning
Q-Learning
Q-Learning: sample-based Q-value iteration
Learn Q(s,a) values as you go
Receive a sample (s,a,s’,r)
Consider your old estimate:
Consider your new sample estimate:
Incorporate the new estimate into a running average:
Q-Learning
Q-Learning
Q(s,a)=0 Sample α = 0.4
8
9
10
9
13
12
6
7
9
521495A: Artificial Intelligence
Lectured by
Abdenour Hadid
Associate Professor, CMVS, University of Oulu
Slides adopted from http://ai.berkeley.edu
Classification, Perceptron, Clustering
Machine Learning
Up until now: how use a model to make optimal decisions
Machine learning: how to acquire a model from data / experience
Today: Classification, Perceptron and Clustering
Concepts that we will study today!!!
Classification
Assignment of a label to a given input value
Classification
ClassifierInputs Label
Classification
ClassifierDear Sir.
First, I must solicit your confidencein this transaction, this is byvirture of its nature as beingutterly confidencial and topsecret. …
Classification
Classifier
Classification
Classifier 2
Example: Spam Filter
Input: an email Output: spam/ham
unsolicited bulk messages, especially in advertising
Setup: Get a large collection of example emails, each labeled
“spam” or “ham” Note: someone has to hand label all this data! Want to learn to predict labels of new, future emails
Features: The attributes used to make the ham / spam decision Words: FREE! Text Patterns: $dd, CAPS Non-text: SenderInContacts …
Dear Sir.
First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …
TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT.
99 MILLION EMAIL ADDRESSESFOR ONLY $99
Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.
Example: Emotion Recognition
Input: face images/ videos
Output: an expression (happy, sad, angry, etc..)
Setup: Get a collection of example images, each labeled with an expression Note: someone has to hand label all this data! Want to learn to predict labels of new, future face images
Features: The attributes used to make the digit decision Mouth Open: =ON Facial Musles= Eyebrows are squeezed=
Example: Digit Recognition
Input: images / pixel grids
Output: a digit 0-9
Setup: Get a large collection of example images, each labeled with a digit Note: someone has to hand label all this data! Want to learn to predict labels of new, future digit images
Features: The attributes used to make the digit decision Pixels: (6,8)=ON Shape Patterns: NumComponents, AspectRatio, NumLoops …
0
1
2
1
??
Classification
ClassifierInputs Label
Think about some examples of classification problems
Other Classification Tasks
Classification: given inputs x, predict labels (classes) y
Examples: Spam detection (input: document,
classes: spam / ham) OCR (input: images, classes: characters) Medical diagnosis (input: symptoms,
classes: diseases) Automatic essay grading (input: document,
classes: grades) Fraud detection (input: account activity,
classes: fraud / no fraud) Customer service email routing … many more
Classification is an important commercial technology!
Training and Testing
Important Concepts
Data: labeled instances, e.g. emails marked spam/ham Training set Held out set (development) Test set
Features: attribute-value pairs which characterize each x
Experimentation cycle Learn parameters (e.g. model probabilities) on training set (Tune hyperparameters on held-out set) Compute accuracy of test set Very important: never “peek” at the test set!
Evaluation Accuracy: fraction of instances predicted correctly
Overfitting and generalization Want a classifier which does well on test data Overfitting: fitting the training data very closely, but not
generalizing well
TrainingData
Held-OutData
TestData
Generalization and Overfitting
Overfitting
Features
Errors, and What to Do
Examples of errors
Dear GlobalSCAPE Customer,
GlobalSCAPE has partnered with ScanSoft to offer you the
latest version of OmniPage Pro, for just $99.99* - the regular
list price is $499! The most common question we've received
about this offer is - Is this genuine? We would like to assure
you that this offer is authorized by ScanSoft, is genuine and
valid. You can get the . . .
. . . To receive your $30 Amazon.com promotional certificate,
click through to
http://www.amazon.com/apparel
and see the prominent link for the $30 offer. All details are
there. We hope you enjoyed receiving this message. However, if
you'd rather not receive future e-mails announcing new store
launches, please click . . .
What to Do About Errors?
Need more features– words aren’t enough! Have you emailed the sender before? Have 1K other people just gotten the same email? Is the sending information consistent? Is the email in ALL CAPS? Do URLs point where they say they point? Does the email address you by (your) name?
Can add these information sources as new variables
Baselines
First step: get a baseline Baselines are very simple “straw man” procedures
Help determine how hard the task is Help know what a “good” accuracy is
Weak baseline: most frequent label classifier Gives all test instances whatever label was most common in the training set
E.g. for spam filtering, might label everything as ham Accuracy might be very high if the problem is skewed
E.g. calling everything “ham” gets 66%, so a classifier that gets 70% isn’t very good…
For real research, usually use previous work as a (strong) baseline
Confidences from a Classifier
The confidence of a probabilistic classifier: Posterior over the top label
Represents how sure the classifier is of the classification
Any probabilistic model will have confidences No guarantee confidence is correct
Calibration Weak calibration: higher confidences mean
higher accuracy Strong calibration: confidence predicts accuracy
rate
Precision vs. Recall
Let’s say we want to classify web pages ashomepages or not In a test set of 1K pages, there are 3 homepages Our classifier says they are all non-homepages 99.7 accuracy! Need new measures for rare positive events
Precision: fraction of guessed positives which were actually positive“fraction of retrieved instances that are relevant”
Recall: fraction of actual positives which were guessed as positive (sensitivity)“fraction of relevant instances that are retrieved”
Say we guess 5 homepages, of which 2 were actually homepages Precision: 2 correct / 5 guessed = 0.4 Recall: 2 correct / 3 true = 0.67
Which is more important in airport face recognition (Criminals)?
-
guessed +
actual +
Precision vs. Recall
Precision/recall tradeoff Often, you can trade off
precision and recall
Only works well with weakly calibrated classifiers
To summarize the tradeoff: Break-even point: precision
value when p = r
F-measure: harmonic mean of p and r:
False Positive versus False Negative
521495A: Artificial Intelligence
Lectured by
Abdenour Hadid
Associate Professor, CMVS, University of Oulu
Slides adopted from http://ai.berkeley.edu
Perceptron
Linear Classifiers
Feature Vectors
Hello,
Do you want free printr
cartriges? Why pay more
when you can get them
ABSOLUTELY FREE! Just
# free : 2
YOUR_NAME : 0
MISSPELLED : 2
FROM_FRIEND : 0
...
SPAMor+
PIXEL-7,12 : 1
PIXEL-7,13 : 0
...
NUM_LOOPS : 1
...
“2”
Some (Simplified) Biology
Very loose inspiration: human neurons
Linear Classifiers
Inputs are feature values
Each feature has a weight
Sum is the activation
If the activation is: Positive, output +1
Negative, output -1
f1
f2
f3
w1
w2
w3
>0?
Weights
Binary case: compare features to a weight vector
Learning: figure out the weight vector from examples
# free : 2
YOUR_NAME : 0
MISSPELLED : 2
FROM_FRIEND : 0
...
# free : 4
YOUR_NAME :-1
MISSPELLED : 1
FROM_FRIEND :-3
...
# free : 0
YOUR_NAME : 1
MISSPELLED : 1
FROM_FRIEND : 1
...
Dot product positive
means the positive class
Decision Rules
Binary Decision Rule
In the space of feature vectors
Examples are points
Any weight vector is a hyperplane
One side corresponds to Y=+1
Other corresponds to Y=-1
BIAS : -3
free : 4
money : 2
...0 1
0
1
2
freem
on
ey
+1 = SPAM
-1 = HAM
Weight Updates
Learning: Binary Perceptron
Start with weights = 0
For each training instance:
Classify with current weights
If correct (i.e., y=y*), no change!
If wrong: adjust the weight vector
Learning: Binary Perceptron
Start with weights = 0
For each training instance:
Classify with current weights
If correct (i.e., y=y*), no change!
If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is -1.
Examples: Perceptron
Separable Case
Properties of Perceptrons
Separability: true if some parameters get the training set perfectly correct
Convergence: if the training is separable, perceptron will eventually converge (binary case)
Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability
Separable
Non-Separable
Examples: Perceptron
Non-Separable Case
Linear Separators
Which of these linear separators is optimal?
Support Vector Machines
Maximizing the margin: good according to intuition, theory, practice
Only support vectors matter; other training examples are ignorable
Support vector machines (SVMs) find the separator with max margin
SVM
521495A: Artificial Intelligence
Lectured by
Abdenour Hadid
Associate Professor, CMVS, University of Oulu
Slides adopted from http://ai.berkeley.edu
Clustering
Non-Separable Data
Nearest-Neighbor Classification
Nearest neighbor for digits: Take new image Compare to all training images Assign based on closest example
Encoding: image is vector of intensities:
What’s the similarity function? Dot product of two images vectors?
Usually normalize vectors so ||x|| = 1
0
1
2
0
1
2
Similarity Functions
Basic Similarity
Many similarities based on feature dot products:
If features are just the pixels:
Note: not all similarities are of this form
A Tale of Two Approaches…
Nearest neighbor-like approaches
Can use fancy similarity functions
Don’t actually get to do explicit learning
Perceptron-like approaches
Explicit training to reduce empirical error
Can’t use fancy similarity, only linear
Non-Linearity
Non-Linear Separators
Data that is linearly separable works out great for linear decision rules:
But what are we going to do if the dataset is just too hard?
How about… mapping data to a higher-dimensional space:
0
0
0
x2
x
x
x
Non-Linear Separators
General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:
Φ: x→ φ(x)
Some Kernels
Kernels implicitly map original vectors to higher dimensional spaces, take the dot product there, and hand the result back
Linear kernel:
Quadratic kernel:
RBF: infinite dimensional representation
Discrete kernels: e.g. string kernels
521495A: Artificial Intelligence
Lectured by
Abdenour Hadid
Associate Professor, CMVS, University of Oulu
Slides adopted from http://ai.berkeley.edu
Clustering
Recap: Classification
Classification systems: Supervised learning
Make a prediction given evidence
We’ve seen several methods for this
Useful when you have labeled data
Clustering
Clustering systems: Unsupervised learning Detect patterns in unlabeled data
E.g. group emails or search results E.g. find categories of customers E.g. detect anomalous program executions
Useful when don’t know what you’re looking for
Requires data, but no labels
Clustering
Clustering
Basic idea: group together similar instances Example: 2D point patterns
What could “similar” mean? One option: small (squared) Euclidean distance
K-Means
K-Means
An iterative clustering algorithm Pick K random points as cluster
centers (means) Alternate:
Assign data instances to closest mean
Assign each mean to the average of its assigned points
Stop when no points’ assignments change
How many clusters to pick?
Similarity measure depends on the applications
Similarity measure?
About features
e.g. face descriptors for recognition
The face features for instance, should be robust to changes in pose, facial
expression, illumination, aging, occlusion, presence of glasses etc.
Same features
Recap: Keywords/Concepts