cake talk probability forecasting

7/27/2019 Cake Talk Probability Forecasting

1/46

Reliable Probability

Forecasting a Machine

Learning PerspectiveDavid Lindsay

Supervisors: Zhiyuan Luo, AlexGammerman, Volodya Vovk


2/46

Overview

What is probability forecasting? Reliabilityand resolution criteria

Experimental design

Problems with traditional assessment methods:square loss, log loss and ROC curves

Probability Calibration Graph (PCG)

Traditional learners are unreliable yet accurate!

Extension ofVenn Probability Machine (VPM)

Which learners are reliable?

Psychological and theoretical viewpoint


3/46

Probability Forecasting

Qualified predictions important in many

applications (especially medicine).

Most machine learning algorithms makebare predictions.

Those that do make qualified predictions

make no claims of how effective themeasures are!


4/46

Probability Forecasting: Generalisation

of Pattern Recognition Goal of pattern recognition = find the best label

for each new test object.

Example Abdominal Pain Dataset:

Training Set to learn from

Label

Diagnosisi

y

Object

Patient

Details

ixName: David

Sex: M

Height: 62

Appendicitis

Name: Daniil

Sex: M

Height: 64

Dyspepsia

Name: Mark

Sex: M

Height: 61

Non-specific

,...,Name: Sian

Sex: F

Height: 58

Dyspepsia

, ,Name: Wilma

Sex: F

Height: 56

?

Test Object,

what is thetrue label?

True label

unknown or

withheldfromlearner


5/46

Probability Forecasting: Generalisation

of Pattern Recognition Probability forecast estimate the conditional probability

of a label given an observed object( | ) P

r( | )P y x y x

learner

Training

setName: Helen

Sex: FHeight: 56

Name: Helen

Sex: FHeight: 56

Name: Helen

Sex: FHeight: 56

Name: Helen

Sex: FHeight: 56

Test

object

?

Name: HelenSex: F

Height: 56

(Dyspepsia | )P = 0.1Name: Helen

Sex: FHeight: 56

(Appendicitis | )P = 0.7Name: Helen

Sex: FHeight: 56

= 0.2 (Non spec | )P

Name: Helen

Sex: FHeight: 56

etc

We want learner to estimate probabilities forallpossible class labels:


6/46

Probability forecasting more

formally X object space, Y label space, Z = X Y example space

Our learner makes probability forecasts forallpossible

labels

1 2 1 1 1 1 1 1 1 , , , , ( 1| ), ( 2 | ), , ( | )n n n n n n n nz z z x P y x P y x P y x Y

1 1 |argmaxn ni

y P i x

Y

Use probability forecasts to predict label most likely label:


7/46

Back to the plan


Experimental design








8/46

Studies of Probability Forecasting

Probability forecasting is well studied area since1970s: Psychology

StatisticsMeteorology

These studies assessed two criteria ofprobability forecasts:

Reliability = the probability forecasts should not lie Resolution = the probability forecasts are practically

useful


9/46

When an event is predicted with probability

should have approx chance of being

incorrect

Reliabilityp

1 p

a.k.a. well calibrated, Considered an asymptotic property.

Dawid (1985) proved no deterministic learner

can be reliable for all data still interesting toinvestigate

This property is often overlooked in practical

studies!


10/46

Resolution

Probability forecasts are practically useful,

e.g. they can be used to rank the labels in

order of likelihood! Closely related to classification accuracy-

common focus of machine learning.

Separate from reliability, i.e. do not gohand in hand (Lindsay, 2004)


11/46

Back to the plan


Experimental design








12/46

Experimental design

Tested several learners on many datasets

in the online setting:

ZeroR = Control

K-Nearest Neighbour

Neural Network

C4.5 Decision Tree

Nave Bayes

Venn Probability Machine Meta Learner (see

later)


13/46

The Online Learning Setting

2 7 6 1 7 ? ?

2 7 6 1 7 2 ?

Before

After

Update training data

for learning machine

for next trial

Learning machinemakes prediction

for new example.

(label withheld)

Repeat process

for all examples


14/46

Lots of benchmark data Tested on data available from the UCI Machine Learning

repository:

Abdom inal Pain:6387 examples, 135 features, 9 classes,Noisy

Diabetes:768 examples, 8 features, 2 classes

Heart-Statlog:270 examples, 13 features, 2 classes

Wiscons in Breast Cancer:685 examples, 10 features, 2classes

American Votes:435 examples, 16 features, 2 classes

Lymphography :148 examples, 18 features, 4 classes

Credit Card Appl icat ions :690 examples, 15 features, 2classes

Ir is Flower:150 examples, 4 features, 3 classes

And many more


15/46

Programs

Extended the WEKA data mining system

implemented in Java:

Added VPM meta learner to existing library ofalgorithms

Allow learners to be tested in online setting.

Created Matlab scripts to easily createplots (see later)


16/46

Results, papers and website All results that I discuss today can be found in my

3 tech reports: The Probability Calibration Graph - a useful

visualisation of the reliability of probability forecasts,

Lindsay (2004), CLRC-TR-04-01Multi-class probability forecasting using the Venn

Probability Machine - a comparison with traditionalmachine learning methods, Lindsay (2004), CLRC-TR-04-02

Rapid implementation of Venn Probability Machines,Lindsay (2004), CLRC-TR-04-03

And on my web site: http://www.david-lindsay.co.uk/research.html


17/46

Back to the plan


Experimental design








18/46

Loss Functions

2

1 1

( ),

n

j i

s ni jy j

i

I p

Y

Square loss

,1 1

( ) logi

n

i jy ji j

l n I p

Y

Log loss

There are many other possible loss functions

Degroot and Feinberg (1982) showed that all loss

functions measure a mixture ofreliabilityand resolution

Log loss punishes more harshly: forced to spread its

bets


19/46

ROC Curves

Nave Bayes on the Abdominal pain data set

1. Graph shows trade off

between false and true

positive predictions

2. Want curve to be as

close to the upper left

corneras possible

(away from diagonal)

3. My results show that

this graph tests

resolution.

4. Area under curve

provides measure of

quality of probability

forecasts.


20/46


21/46

Problems with Traditional

Assessment Loss functions and ROC give more information

than error rate about the quality of probability

forecasts.

But

loss functions = mixture of resolution and reliability

ROC curve = measures resolution

Dont have any method ofsolelyassessingreliability

Dont have method of telling if probability

forecasts are over- or under- estimated


22/46

Back to the plan


Experimental design








23/46

Inspiration for PCG (Meteorology)

Murphy & Winkler (1977)

Calibration data for

precipitation forecasts

Reliable points lie

close to diagonal


24/46

A PCG plot of ZeroR on Abdominal Pain

Predicted Probability

Empiricalfrequencyofbeingco

rrect

Line ofcalibration

PCG

coordinates

Reliability PCG coordinates lie close to line of calibration

i.e. ZeroR may is not accurate but it is reliable!

Plot may not

span wholeaxis ZeroR

makes no

predictions with

high probability


25/46

PCG a visualisation tool and measure of reliability

Total 2764.5

Mean 0.0483

Standard Deviation 0.0757

Max 0.4203

Min 4.9e-17

Nave Bayes VPM Nave Bayes

VPM is reliable as PCG follows the

diagonal!

Total 496.7

Mean 0.0087

Standard Deviation 0.0112

Max 0.1017

Min 9.2e-8

Over and under estimates its

probabilities much like real doctors!

Unreliable, forecast of 0.9 only has 0.55

chance being right! (over estimate)

Unreliable, forecast of 0.1 only has 0.3

chance being right! (under estimate)


26/46

Learners predicting like people!

Nave Bayes People

Lots of psychological research people make unreliable

probability forecasts


27/46

Back to the plan


Experimental design








28/46

Table comparing scores with PCG

838.1 (4)0.76 (1)0.8 (4)0.54 (5)40.7 (8)VPM C4.5

2764.5 (7)0.72 (5)1.3 (7)0.50 (4)29.2 (2)Nave Bayes

496.7 (1)0.75 (2)0.6 (1)0.44 (1)28.9 (1)VPM Nave Bayes

5062.9 (11)0.54 (10)2.6 (10)1.0 (11)33.4 (4)10-NN

4492.7 (10)0.55 (9)2.2 (9)0.96 (10)33.4 (4)20-NN

3481.2 (8)0.57 (8)3.3 (11)0.67 (7)39.6 (7)C4.5

1320.5 (6)0.75 (3)0.72 (2)0.45 (2)30.5 (3)Neural Net

921.2 (5)0.74 (4)0.73 (3)0.47 (3)34.3 (5)30-NN

554.6 (2)0.61 (6)0.9 (5)0.58 (6)41.6 (9)VPM 1-NN

4307.5 (9)0.59 (7)2.1 (8)0.73 (8)34.6 (6)1-NN

678.6 (3)0.49 (11)1.1 (6)0.74 (9)55.6 (10)ZeroR

PCGROC

Area

Log

Loss

Sqr

Loss

ErrorAlgorithm


29/46

Correlations of scores

Inverse No-0.1ROC vs. Sqr

Reliability

Direct Weak0.26PCG vs. Error

Direct No0.04PCG vs. Sqr

Resolution

Direct Strong0.76PCG vs. Sqr

Reliability

InterpretationCorr. Coeff.Scores

Inverse Moderate-0.52ROC vs. Error

Direct Strong0.67ROC vs. Sqr

Resolution


30/46

Back to the plan


Experimental design








31/46

What is the VPM meta-learner?

Volodyas VPM1. Predicts a label

2. Produces upperu and lowerlbounds forpredicted label only

My VPM extension1. Extracts more information

2. Produces probability forecast forallpossible labels

3. Predicts a label using these probability forecasts.

4. Produces Volodyas bounds as well!

Learner

VPM meta

learning

framework

VPM sits on top of existing learner to complement

predictions with probability estimates


32/46

Volodyas original use of VPM

Online Trial Number

Errorrate

andbounds

22.1%1414.1Low Error

28.9%1835Error

34.7%2216.5Up Error

Upper (red) andlower (green)

bounds lie above

and belowthe

actual number of

errors (black)

made on the

data.


33/46

Output from VPM compared with

that of original underlying learner

Key: Predicted = underlined , Actual =

NANA7.6e-

9

6.3e-

10

4.0e-112.2e-91.3e-

9

0.071.7e-

13

2.9e-90.935831

NANA2.2e-

4

2.2e-

7

0.20.460.162.3e-

5

0.170.019.4e-52490

NANA1.3e-44.1e-103.4e-34.2e-30.994.4e-53.3e-64.5e-63.08e-91653

Nave Bayes

LowUpDysp.Renal.

PancrIntest

obstr

CholiNon.

Spec

Perf.

Pept.

Div.Appx

BoundsProbability forecast for each class labelTrial#

0.410.680.010.010.00.010.010.420.00.010.535831

0.070.710.40.090.080.150.050.070.100.030.022490

0.080.820.090.010.040.00.730.080.030.00.031653

VPM Nave Bayes


34/46

Back to the plan


Experimental design








35/46

ZeroR

Heart Disease Lymphography Diabetes

ZeroR outputs probability forecasts which are mere label

frequencies

ZeroR predicts the majority class labelat each trial.

Uses no information about the objects in its learning the

simplest of all learners.

Accuracy is poor, but reliability is good.


36/46

K-NN

10-NN 20-NN 30-NN

K-NN finds subset of K closest (nearest neighbouring)

examples in training data using a distance metric. Then

counts the label frequencies amongst this subset.

Acts like a more sophisticated version of ZeroR that usesinformation held in the object.

Appropriate choice of K must be made to obtain reliable

probability forecasts (depends on data).

Traditional Learners and VPM


37/46

Traditional Learners and VPM Traditional learners can be very unreliable (yet accurate) - depends on

data.

My research shows empirically that VPM is reliable.

And it can recalibrate a learners original probability forecasts to make themmore reliable!

Improvement in reliability often without detrimentto classification accuracy.

Nave Bayes

VPM Nave Bayes

C4.5

VPM C4.5

Neural Net

VPM Neural Net

1-NN

VPM 1-NN


38/46

Back to the plan


Experimental design








39/46

Psychological Heuristics

When faced with the difficult task of judgingprobability, people employ a limited number ofheuristics which reduce the judgements tosimpler ones:

Availability - An event is predicted more likely tooccur if it has occurred frequently in the past

Representativeness - One compares the essentialfeatures of the event to those of the structure ofprevious events

Simulation - The ease in which the simulation of asystem of events reaches a particular state can beused to judge the propensity of the (real) system toproduce that state.


40/46

Interpretation of reliable learners

using heuristics ZeroR, K-NN and VPM learners are

reliable probability forecasters.

Can identify heuristics in these learningalgorithms

Remember psychological research states:

More heuristics More reliable forecasts


41/46

Psychological Interpretation of

ZeroR The simplest of all reliable probability

forecasters uses 1 heuristic:

The learner merely counts labels it hasobserved so far, and uses the frequencies of

labels as its forecasts (Availability)


42/46


K-NN More sophisticated than the ZeroR

learner, the K-NN learner uses 2

heuristics:Uses the distance metric to find subset of K

closest examples in training set.

(Representativeness)

Then counts the label frequencies in the

subset of K-nearest neighbours to makes its

forecasts (Availability)


43/46


VPM Even more sophisticated the VPM meta-learner uses all 3 heuristics:

The VPM tries each new test example with all

possible classifications (Simulation)

Then under each tentative simulation clusters

training examples which are similar into

groups (Representativeness)Finally the VPM calculates the frequency of

labels in each of these groups to make its

forecasts (Availability)


44/46

Theoretical justifications

ZeroR can be proven to be asymptotically

reliable (but experiments show well in

finite data) K-NN has lots of theory Stone (1977) to

support its convergence to true probability

distribution VPM has a lots of theoretical justification

for finite data using martingales


45/46

Take home points Probability forecasting is useful for real life

applications especially medicine.

Want learners to be reliable and accurate.

PCG can be used to check reliability.

ZeroR, K-NN and VPM provide consistentlyreliable probability forecasts.

Traditional learners Nave Bayes, Neural Net

and Decision Tree can provide unreliableforecasts.

VPM can be used to improve reliability ofprobability forecasts without detriment to

classification accuracy.


46/46

SupervisionAlex Gammerman

Volodya Vovk

Zhiyuan Luo

Mathematical AdviceDaniil Riabko

Volodya Vovk

Teo Sharia

ProofreadingZhiyuan Luo

Sin Cox

Graphics & DesignSin Cox

CateringSin Cox

Fin Acknowledgments

cake talk probability forecasting

Documents