521495a: artificial intelligence - noppa.oulu.fi · pdf file521495a: artificial intelligence...

521495A: Artificial Intelligence

Lectured by

Abdenour Hadid

Associate Professor, CMVS, University of Oulu

Slides adopted from http://ai.berkeley.edu

Classification, Perceptron, Clustering

Q-Learning

Non-Deterministic Search

Solving MDPs

Q-Learning

Q-Learning

Q-Learning: sample-based Q-value iteration

Learn Q(s,a) values as you go

Receive a sample (s,a,s’,r)

Consider your old estimate:

Consider your new sample estimate:

Incorporate the new estimate into a running average:

Q-Learning

Q-Learning

Q(s,a)=0 Sample α = 0.4

8

9

10

9

13

12

6

7

9


Lectured by

Abdenour Hadid



Classification, Perceptron, Clustering

Machine Learning

Up until now: how use a model to make optimal decisions

Machine learning: how to acquire a model from data / experience

Today: Classification, Perceptron and Clustering

Concepts that we will study today!!!

Classification

Assignment of a label to a given input value

Classification

ClassifierInputs Label

Classification

ClassifierDear Sir.

First, I must solicit your confidencein this transaction, this is byvirture of its nature as beingutterly confidencial and topsecret. …

Classification

Classifier

Classification

Classifier 2

Example: Spam Filter

Input: an email Output: spam/ham

unsolicited bulk messages, especially in advertising

Setup: Get a large collection of example emails, each labeled

“spam” or “ham” Note: someone has to hand label all this data! Want to learn to predict labels of new, future emails

Features: The attributes used to make the ham / spam decision Words: FREE! Text Patterns: $dd, CAPS Non-text: SenderInContacts …

Dear Sir.

First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …

TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT.

99 MILLION EMAIL ADDRESSESFOR ONLY $99

Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

Example: Emotion Recognition

Input: face images/ videos

Output: an expression (happy, sad, angry, etc..)

Setup: Get a collection of example images, each labeled with an expression Note: someone has to hand label all this data! Want to learn to predict labels of new, future face images

Features: The attributes used to make the digit decision Mouth Open: =ON Facial Musles= Eyebrows are squeezed=

Example: Digit Recognition

Input: images / pixel grids

Output: a digit 0-9

Setup: Get a large collection of example images, each labeled with a digit Note: someone has to hand label all this data! Want to learn to predict labels of new, future digit images

Features: The attributes used to make the digit decision Pixels: (6,8)=ON Shape Patterns: NumComponents, AspectRatio, NumLoops …

0

1

2

1

??

Classification

ClassifierInputs Label

Think about some examples of classification problems

Other Classification Tasks

Classification: given inputs x, predict labels (classes) y

Examples: Spam detection (input: document,

classes: spam / ham) OCR (input: images, classes: characters) Medical diagnosis (input: symptoms,

classes: diseases) Automatic essay grading (input: document,

classes: grades) Fraud detection (input: account activity,

classes: fraud / no fraud) Customer service email routing … many more

Classification is an important commercial technology!

Training and Testing

Important Concepts

Data: labeled instances, e.g. emails marked spam/ham Training set Held out set (development) Test set

Features: attribute-value pairs which characterize each x

Experimentation cycle Learn parameters (e.g. model probabilities) on training set (Tune hyperparameters on held-out set) Compute accuracy of test set Very important: never “peek” at the test set!

Evaluation Accuracy: fraction of instances predicted correctly

Overfitting and generalization Want a classifier which does well on test data Overfitting: fitting the training data very closely, but not

generalizing well

TrainingData

Held-OutData

TestData

Generalization and Overfitting

Overfitting

Features

Errors, and What to Do

Examples of errors

Dear GlobalSCAPE Customer,

GlobalSCAPE has partnered with ScanSoft to offer you the

latest version of OmniPage Pro, for just $99.99* - the regular

list price is $499! The most common question we've received

about this offer is - Is this genuine? We would like to assure

you that this offer is authorized by ScanSoft, is genuine and

valid. You can get the . . .

. . . To receive your $30 Amazon.com promotional certificate,

click through to

http://www.amazon.com/apparel

and see the prominent link for the $30 offer. All details are

there. We hope you enjoyed receiving this message. However, if

you'd rather not receive future e-mails announcing new store

launches, please click . . .

What to Do About Errors?

Need more features– words aren’t enough! Have you emailed the sender before? Have 1K other people just gotten the same email? Is the sending information consistent? Is the email in ALL CAPS? Do URLs point where they say they point? Does the email address you by (your) name?

Can add these information sources as new variables

Baselines

First step: get a baseline Baselines are very simple “straw man” procedures

Help determine how hard the task is Help know what a “good” accuracy is

Weak baseline: most frequent label classifier Gives all test instances whatever label was most common in the training set

E.g. for spam filtering, might label everything as ham Accuracy might be very high if the problem is skewed

E.g. calling everything “ham” gets 66%, so a classifier that gets 70% isn’t very good…

For real research, usually use previous work as a (strong) baseline

Confidences from a Classifier

The confidence of a probabilistic classifier: Posterior over the top label

Represents how sure the classifier is of the classification

Any probabilistic model will have confidences No guarantee confidence is correct

Calibration Weak calibration: higher confidences mean

higher accuracy Strong calibration: confidence predicts accuracy

rate

Precision vs. Recall

Let’s say we want to classify web pages ashomepages or not In a test set of 1K pages, there are 3 homepages Our classifier says they are all non-homepages 99.7 accuracy! Need new measures for rare positive events

Precision: fraction of guessed positives which were actually positive“fraction of retrieved instances that are relevant”

Recall: fraction of actual positives which were guessed as positive (sensitivity)“fraction of relevant instances that are retrieved”

Say we guess 5 homepages, of which 2 were actually homepages Precision: 2 correct / 5 guessed = 0.4 Recall: 2 correct / 3 true = 0.67

Which is more important in airport face recognition (Criminals)?

-

guessed +

actual +

Precision vs. Recall

Precision/recall tradeoff Often, you can trade off

precision and recall

Only works well with weakly calibrated classifiers

To summarize the tradeoff: Break-even point: precision

value when p = r

F-measure: harmonic mean of p and r:

False Positive versus False Negative


Lectured by

Abdenour Hadid



Perceptron

Linear Classifiers

Feature Vectors

Hello,

Do you want free printr

cartriges? Why pay more

when you can get them

ABSOLUTELY FREE! Just

# free : 2

YOUR_NAME : 0

MISSPELLED : 2

FROM_FRIEND : 0

...

SPAMor+

PIXEL-7,12 : 1

PIXEL-7,13 : 0

...

NUM_LOOPS : 1

...

“2”

Some (Simplified) Biology

Very loose inspiration: human neurons

Linear Classifiers

Inputs are feature values

Each feature has a weight

Sum is the activation

If the activation is: Positive, output +1

Negative, output -1

f1

f2

f3

w1

w2

w3

>0?

Weights

Binary case: compare features to a weight vector

Learning: figure out the weight vector from examples

# free : 2

YOUR_NAME : 0

MISSPELLED : 2

FROM_FRIEND : 0

...

# free : 4

YOUR_NAME :-1

MISSPELLED : 1

FROM_FRIEND :-3

...

# free : 0

YOUR_NAME : 1

MISSPELLED : 1

FROM_FRIEND : 1

...

Dot product positive

means the positive class

Decision Rules

Binary Decision Rule

In the space of feature vectors

Examples are points

Any weight vector is a hyperplane

One side corresponds to Y=+1

Other corresponds to Y=-1

BIAS : -3

free : 4

money : 2

...0 1

0

1

2

freem

on

ey

+1 = SPAM

-1 = HAM

Weight Updates

Learning: Binary Perceptron

Start with weights = 0

For each training instance:

Classify with current weights

If correct (i.e., y=y*), no change!

If wrong: adjust the weight vector

Learning: Binary Perceptron

Start with weights = 0

For each training instance:

Classify with current weights

If correct (i.e., y=y*), no change!

If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is -1.

Examples: Perceptron

Separable Case

Properties of Perceptrons

Separability: true if some parameters get the training set perfectly correct

Convergence: if the training is separable, perceptron will eventually converge (binary case)

Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability

Separable

Non-Separable

Examples: Perceptron

Non-Separable Case

Linear Separators

Which of these linear separators is optimal?

Support Vector Machines

Maximizing the margin: good according to intuition, theory, practice

Only support vectors matter; other training examples are ignorable

Support vector machines (SVMs) find the separator with max margin

SVM


Lectured by

Abdenour Hadid



Clustering

Non-Separable Data

Nearest-Neighbor Classification

Nearest neighbor for digits: Take new image Compare to all training images Assign based on closest example

Encoding: image is vector of intensities:

What’s the similarity function? Dot product of two images vectors?

Usually normalize vectors so ||x|| = 1

0

1

2

0

1

2

Similarity Functions

Basic Similarity

Many similarities based on feature dot products:

If features are just the pixels:

Note: not all similarities are of this form

A Tale of Two Approaches…

Nearest neighbor-like approaches

Can use fancy similarity functions

Don’t actually get to do explicit learning

Perceptron-like approaches

Explicit training to reduce empirical error

Can’t use fancy similarity, only linear

Non-Linearity

Non-Linear Separators

Data that is linearly separable works out great for linear decision rules:

But what are we going to do if the dataset is just too hard?

How about… mapping data to a higher-dimensional space:

0

0

0

x2

x

x

x

Non-Linear Separators

General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x→ φ(x)

Some Kernels

Kernels implicitly map original vectors to higher dimensional spaces, take the dot product there, and hand the result back

Linear kernel:

Quadratic kernel:

RBF: infinite dimensional representation

Discrete kernels: e.g. string kernels


Lectured by

Abdenour Hadid



Clustering

Recap: Classification

Classification systems: Supervised learning

Make a prediction given evidence

We’ve seen several methods for this

Useful when you have labeled data

Clustering

Clustering systems: Unsupervised learning Detect patterns in unlabeled data

E.g. group emails or search results E.g. find categories of customers E.g. detect anomalous program executions

Useful when don’t know what you’re looking for

Requires data, but no labels

Clustering

Clustering

Basic idea: group together similar instances Example: 2D point patterns

What could “similar” mean? One option: small (squared) Euclidean distance

K-Means

K-Means

An iterative clustering algorithm Pick K random points as cluster

centers (means) Alternate:

Assign data instances to closest mean

Assign each mean to the average of its assigned points

Stop when no points’ assignments change

How many clusters to pick?

Similarity measure depends on the applications

Similarity measure?

About features

e.g. face descriptors for recognition

The face features for instance, should be robust to changes in pose, facial

expression, illumination, aging, occlusion, presence of glasses etc.

Same features

Recap: Keywords/Concepts

521495a: artificial intelligence - noppa.oulu.fi · pdf file521495a: artificial intelligence...

Documents