ki1 / l. schomaker - 2007 learning from observations (b) how good is a machine learner?...

KI1 / L. Schomaker - 2007

Learning from observations (b)

• How good is a machine learner?

• Experimentation protocols

• Performance measures

• Academic benchmarks vs Real Life


Experimentation protocols

• Fooling yourself: training a decision tree on 100 example instances from earth and sending the robot to Mars

training set / test set distinction– both must be of sufficient size:– large training set for reliable ‘h’ (coefficients etc.)– large test set for reliable prediction of real-life

performance



• one training set / one test set, four yearsPhD project: still fooling yourself!

• Solution: – training set– test set– final evaluation set with real-life data

• k-Fold evaluation: k subsets fromlarge data base, measuring standard deviationof performance over experiments



• What to do if your don’t have enough data?

• Solution: – Leave-one-out: use N-1 samples for training,– use the Nth sample for testing– repeat for all samples– compute the average performance


Performance

• Example: % correctly classified samples (P)

• Ptrain

• Ptest

• Preal Ptest


Performance, two-class

IS \ SAYS YES NO total

YES #correcthits

#misses #is_Yes

NO #falsehits

#correctrejects

#is_No

total #says_Yes #says_No #N_samples




YES % correcthits

%misses %is_Yes

NO %falsehits

%correctrejects

%is_No

total %says_Yes %says_No 100 %




YES #correcthits

#misses #is_Yes

NO #falsehits

#correctrejects

#is_No

total #says_Yes #says_No #N_samples

Precision = 100 * #correct_hits / #says_Yes [%]Recall = 100 * #correct_hits / #is_Yes [%]


Performance, multi-class SAYSIS

A B C ... Rej total

A #A ok #is A

B #B ok #is B

C #C ok #is C

. #is ...

Noise #correctrejects

#is Noise

total #says

A

#says

B

#says

C

#says

...

#says

Reject

#Nsamples


Performance, multi-class SAYSIS

A B C ... Rej total

A 456 0 2 34 5 #is A

B 0 343 #is B

C 20 201 #is C

. 0 603 #is ...

Noise 1 60 #is Noise

total #says

A

#says

B

#says

C

#says

...

#says

Reject

#Nsamples

Confusion matrix


Rankings / hit lists

• Given a query Q, systems returns a hitlist of matches M: an ordered set, with instances i in decreasing likelihood of correctness

• Precision: proportion of correct instances M in the hit list

• Recall: proportion of correct instances from totalnumber of target samples in the database


Function approximation

• For, e.g. regression models,learning an ‘analog’output

• Example: target function t(x)• Obtained output function o(x)

• For performance evaluation computeroot-mean square error (RMS error):

= ( (o(x)-t(x))2 / N )


Learning curves

#epochs (presentations of training set)

P [% OK]


Learning curves


P [% OK]performance on training set

performance on test set


Learning curves


P [% OK] performance on training set


100%


Learning curves


P [% OK] performance on training set


100%

Stop!

no generalization,overfit


Overfitting

• The learner learns the training set

• Even perfectly, like a lookup table (LUT) memorizing training instances

• without correctly handling unseen data

• Usual cause: more parameters in the learner than in the data


Preventing Overfit

• For good generalization:

– number of training examples must be much larger than the number of attributes (features):

Nsamples / Nattr >> 1


Preventing Overfit

• For good generalization:

– also: Nsamples >> Ncoefficients

e.g.: solving linear equation: 2 coefficients, needing 2 data points in 2D

Coefficients: model parameters, weights etc.


Preventing Overfit• For good generalization:

– Ndatavalues >> Ncoefficients

Coefficients: model parameters, weights etc.

Ndatavalues = Nsamples * Nattributes

e.g.: use Ndatavalues/Ncoefficients for system comparison


Example: machine-print OCR

• Very accurate, today, but:• Needs 5000 examples of each character• Printed on ink-jet, laser printers, matrix

printers, fax copies • of many brands of printers• on many paper types

for 1 font & point size!A .

...

.


Ensemble methods

• Boosting:– train a learner h[m]– weigh each of the instances– weigh the method m– train a new learner h[m+1]– perform majority voting on ensemble

opinions

KI1 / L. Schomaker - 2007The advantage of democracy: partly intelligent, independent deciders


Learning methods

• Gradient descent, parameter finding(multi-layer perceptron, regression)

• Expectation Maximization (smart Monte Carlo search for best model, given the data)

• Knowledge-based, symbolic learning (Version Spaces)

• Reinforcement learning• Bayesian learning


Memory-based ‘learning’

• Lookup-table (LUT)

• Nearest neighbour argmin(dist)

• k-Nearest neighbour

majority(Nargmin(dist,k)


Unsupervised learning

• K-means clustering

• Kohonen self-organizing maps (SOM)

• Hierarchical clustering


Summary (1)

• Learning needed for unknown environments (and/or) lazy designers

• Learning agent = performance element + learning element

• Learning method depends on type of performance element, available

• feedback, type of component to be improved, and its representation


Summary (2)

• For supervised learning, the aim is to find a simple hypothesis

• that is approximately consistent with training examples

• Decision tree learning using information gain: entropy-based

• Learning performance = prediction accuracy measured on test set(s)

ki1 / l. schomaker - 2007 learning from observations (b) how good is a machine learner?...

Documents

training set performance

test set slide

ok slide

n slide

ok performance

average performance

large training set

correct hits