cake talk probability forecasting

Upload: ashwani-kumar

Post on 02-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Cake Talk Probability Forecasting

    1/46

    Reliable Probability

    Forecasting a Machine

    Learning PerspectiveDavid Lindsay

    Supervisors: Zhiyuan Luo, AlexGammerman, Volodya Vovk

  • 7/27/2019 Cake Talk Probability Forecasting

    2/46

    Overview

    What is probability forecasting? Reliabilityand resolution criteria

    Experimental design

    Problems with traditional assessment methods:square loss, log loss and ROC curves

    Probability Calibration Graph (PCG)

    Traditional learners are unreliable yet accurate!

    Extension ofVenn Probability Machine (VPM)

    Which learners are reliable?

    Psychological and theoretical viewpoint

  • 7/27/2019 Cake Talk Probability Forecasting

    3/46

    Probability Forecasting

    Qualified predictions important in many

    applications (especially medicine).

    Most machine learning algorithms makebare predictions.

    Those that do make qualified predictions

    make no claims of how effective themeasures are!

  • 7/27/2019 Cake Talk Probability Forecasting

    4/46

    Probability Forecasting: Generalisation

    of Pattern Recognition Goal of pattern recognition = find the best label

    for each new test object.

    Example Abdominal Pain Dataset:

    Training Set to learn from

    Label

    Diagnosisi

    y

    Object

    Patient

    Details

    ixName: David

    Sex: M

    Height: 62

    Appendicitis

    Name: Daniil

    Sex: M

    Height: 64

    Dyspepsia

    Name: Mark

    Sex: M

    Height: 61

    Non-specific

    ,...,Name: Sian

    Sex: F

    Height: 58

    Dyspepsia

    , ,Name: Wilma

    Sex: F

    Height: 56

    ?

    Test Object,

    what is thetrue label?

    True label

    unknown or

    withheldfromlearner

  • 7/27/2019 Cake Talk Probability Forecasting

    5/46

    Probability Forecasting: Generalisation

    of Pattern Recognition Probability forecast estimate the conditional probability

    of a label given an observed object( | ) P

    r( | )P y x y x

    learner

    Training

    setName: Helen

    Sex: FHeight: 56

    Name: Helen

    Sex: FHeight: 56

    Name: Helen

    Sex: FHeight: 56

    Name: Helen

    Sex: FHeight: 56

    Test

    object

    ?

    Name: HelenSex: F

    Height: 56

    (Dyspepsia | )P = 0.1Name: Helen

    Sex: FHeight: 56

    (Appendicitis | )P = 0.7Name: Helen

    Sex: FHeight: 56

    = 0.2 (Non spec | )P

    Name: Helen

    Sex: FHeight: 56

    etc

    We want learner to estimate probabilities forallpossible class labels:

  • 7/27/2019 Cake Talk Probability Forecasting

    6/46

    Probability forecasting more

    formally X object space, Y label space, Z = X Y example space

    Our learner makes probability forecasts forallpossible

    labels

    1 2 1 1 1 1 1 1 1 , , , , ( 1| ), ( 2 | ), , ( | )n n n n n n n nz z z x P y x P y x P y x Y

    1 1 |argmaxn ni

    y P i x

    Y

    Use probability forecasts to predict label most likely label:

  • 7/27/2019 Cake Talk Probability Forecasting

    7/46

    Back to the plan

    What is probability forecasting? Reliabilityand resolution criteria

    Experimental design

    Problems with traditional assessment methods:square loss, log loss and ROC curves

    Probability Calibration Graph (PCG)

    Traditional learners are unreliable yet accurate!

    Extension ofVenn Probability Machine (VPM)

    Which learners are reliable?

    Psychological and theoretical viewpoint

  • 7/27/2019 Cake Talk Probability Forecasting

    8/46

    Studies of Probability Forecasting

    Probability forecasting is well studied area since1970s: Psychology

    StatisticsMeteorology

    These studies assessed two criteria ofprobability forecasts:

    Reliability = the probability forecasts should not lie Resolution = the probability forecasts are practically

    useful

  • 7/27/2019 Cake Talk Probability Forecasting

    9/46

    When an event is predicted with probability

    should have approx chance of being

    incorrect

    Reliabilityp

    1 p

    a.k.a. well calibrated, Considered an asymptotic property.

    Dawid (1985) proved no deterministic learner

    can be reliable for all data still interesting toinvestigate

    This property is often overlooked in practical

    studies!

  • 7/27/2019 Cake Talk Probability Forecasting

    10/46

    Resolution

    Probability forecasts are practically useful,

    e.g. they can be used to rank the labels in

    order of likelihood! Closely related to classification accuracy-

    common focus of machine learning.

    Separate from reliability, i.e. do not gohand in hand (Lindsay, 2004)

  • 7/27/2019 Cake Talk Probability Forecasting

    11/46

    Back to the plan

    What is probability forecasting? Reliabilityand resolution criteria

    Experimental design

    Problems with traditional assessment methods:square loss, log loss and ROC curves

    Probability Calibration Graph (PCG)

    Traditional learners are unreliable yet accurate!

    Extension ofVenn Probability Machine (VPM)

    Which learners are reliable?

    Psychological and theoretical viewpoint

  • 7/27/2019 Cake Talk Probability Forecasting

    12/46

    Experimental design

    Tested several learners on many datasets

    in the online setting:

    ZeroR = Control

    K-Nearest Neighbour

    Neural Network

    C4.5 Decision Tree

    Nave Bayes

    Venn Probability Machine Meta Learner (see

    later)

  • 7/27/2019 Cake Talk Probability Forecasting

    13/46

    The Online Learning Setting

    2 7 6 1 7 ? ?

    2 7 6 1 7 2 ?

    Before

    After

    Update training data

    for learning machine

    for next trial

    Learning machinemakes prediction

    for new example.

    (label withheld)

    Repeat process

    for all examples

  • 7/27/2019 Cake Talk Probability Forecasting

    14/46

    Lots of benchmark data Tested on data available from the UCI Machine Learning

    repository:

    Abdom inal Pain:6387 examples, 135 features, 9 classes,Noisy

    Diabetes:768 examples, 8 features, 2 classes

    Heart-Statlog:270 examples, 13 features, 2 classes

    Wiscons in Breast Cancer:685 examples, 10 features, 2classes

    American Votes:435 examples, 16 features, 2 classes

    Lymphography :148 examples, 18 features, 4 classes

    Credit Card Appl icat ions :690 examples, 15 features, 2classes

    Ir is Flower:150 examples, 4 features, 3 classes

    And many more

  • 7/27/2019 Cake Talk Probability Forecasting

    15/46

    Programs

    Extended the WEKA data mining system

    implemented in Java:

    Added VPM meta learner to existing library ofalgorithms

    Allow learners to be tested in online setting.

    Created Matlab scripts to easily createplots (see later)

  • 7/27/2019 Cake Talk Probability Forecasting

    16/46

    Results, papers and website All results that I discuss today can be found in my

    3 tech reports: The Probability Calibration Graph - a useful

    visualisation of the reliability of probability forecasts,

    Lindsay (2004), CLRC-TR-04-01Multi-class probability forecasting using the Venn

    Probability Machine - a comparison with traditionalmachine learning methods, Lindsay (2004), CLRC-TR-04-02

    Rapid implementation of Venn Probability Machines,Lindsay (2004), CLRC-TR-04-03

    And on my web site: http://www.david-lindsay.co.uk/research.html

  • 7/27/2019 Cake Talk Probability Forecasting

    17/46

    Back to the plan

    What is probability forecasting? Reliabilityand resolution criteria

    Experimental design

    Problems with traditional assessment methods:square loss, log loss and ROC curves

    Probability Calibration Graph (PCG)

    Traditional learners are unreliable yet accurate!

    Extension ofVenn Probability Machine (VPM)

    Which learners are reliable?

    Psychological and theoretical viewpoint

  • 7/27/2019 Cake Talk Probability Forecasting

    18/46

    Loss Functions

    2

    1 1

    ( ),

    n

    j i

    s ni jy j

    i

    I p

    Y

    Square loss

    ,1 1

    ( ) logi

    n

    i jy ji j

    l n I p

    Y

    Log loss

    There are many other possible loss functions

    Degroot and Feinberg (1982) showed that all loss

    functions measure a mixture ofreliabilityand resolution

    Log loss punishes more harshly: forced to spread its

    bets

  • 7/27/2019 Cake Talk Probability Forecasting

    19/46

    ROC Curves

    Nave Bayes on the Abdominal pain data set

    1. Graph shows trade off

    between false and true

    positive predictions

    2. Want curve to be as

    close to the upper left

    corneras possible

    (away from diagonal)

    3. My results show that

    this graph tests

    resolution.

    4. Area under curve

    provides measure of

    quality of probability

    forecasts.

  • 7/27/2019 Cake Talk Probability Forecasting

    20/46

  • 7/27/2019 Cake Talk Probability Forecasting

    21/46

    Problems with Traditional

    Assessment Loss functions and ROC give more information

    than error rate about the quality of probability

    forecasts.

    But

    loss functions = mixture of resolution and reliability

    ROC curve = measures resolution

    Dont have any method ofsolelyassessingreliability

    Dont have method of telling if probability

    forecasts are over- or under- estimated

  • 7/27/2019 Cake Talk Probability Forecasting

    22/46

    Back to the plan

    What is probability forecasting? Reliabilityand resolution criteria

    Experimental design

    Problems with traditional assessment methods:square loss, log loss and ROC curves

    Probability Calibration Graph (PCG)

    Traditional learners are unreliable yet accurate!

    Extension ofVenn Probability Machine (VPM)

    Which learners are reliable?

    Psychological and theoretical viewpoint

  • 7/27/2019 Cake Talk Probability Forecasting

    23/46

    Inspiration for PCG (Meteorology)

    Murphy & Winkler (1977)

    Calibration data for

    precipitation forecasts

    Reliable points lie

    close to diagonal

  • 7/27/2019 Cake Talk Probability Forecasting

    24/46

    A PCG plot of ZeroR on Abdominal Pain

    Predicted Probability

    Empiricalfrequencyofbeingco

    rrect

    Line ofcalibration

    PCG

    coordinates

    Reliability PCG coordinates lie close to line of calibration

    i.e. ZeroR may is not accurate but it is reliable!

    Plot may not

    span wholeaxis ZeroR

    makes no

    predictions with

    high probability

  • 7/27/2019 Cake Talk Probability Forecasting

    25/46

    PCG a visualisation tool and measure of reliability

    Total 2764.5

    Mean 0.0483

    Standard Deviation 0.0757

    Max 0.4203

    Min 4.9e-17

    Nave Bayes VPM Nave Bayes

    VPM is reliable as PCG follows the

    diagonal!

    Total 496.7

    Mean 0.0087

    Standard Deviation 0.0112

    Max 0.1017

    Min 9.2e-8

    Over and under estimates its

    probabilities much like real doctors!

    Unreliable, forecast of 0.9 only has 0.55

    chance being right! (over estimate)

    Unreliable, forecast of 0.1 only has 0.3

    chance being right! (under estimate)

  • 7/27/2019 Cake Talk Probability Forecasting

    26/46

    Learners predicting like people!

    Nave Bayes People

    Lots of psychological research people make unreliable

    probability forecasts

  • 7/27/2019 Cake Talk Probability Forecasting

    27/46

    Back to the plan

    What is probability forecasting? Reliabilityand resolution criteria

    Experimental design

    Problems with traditional assessment methods:square loss, log loss and ROC curves

    Probability Calibration Graph (PCG)

    Traditional learners are unreliable yet accurate!

    Extension ofVenn Probability Machine (VPM)

    Which learners are reliable?

    Psychological and theoretical viewpoint

  • 7/27/2019 Cake Talk Probability Forecasting

    28/46

    Table comparing scores with PCG

    838.1 (4)0.76 (1)0.8 (4)0.54 (5)40.7 (8)VPM C4.5

    2764.5 (7)0.72 (5)1.3 (7)0.50 (4)29.2 (2)Nave Bayes

    496.7 (1)0.75 (2)0.6 (1)0.44 (1)28.9 (1)VPM Nave Bayes

    5062.9 (11)0.54 (10)2.6 (10)1.0 (11)33.4 (4)10-NN

    4492.7 (10)0.55 (9)2.2 (9)0.96 (10)33.4 (4)20-NN

    3481.2 (8)0.57 (8)3.3 (11)0.67 (7)39.6 (7)C4.5

    1320.5 (6)0.75 (3)0.72 (2)0.45 (2)30.5 (3)Neural Net

    921.2 (5)0.74 (4)0.73 (3)0.47 (3)34.3 (5)30-NN

    554.6 (2)0.61 (6)0.9 (5)0.58 (6)41.6 (9)VPM 1-NN

    4307.5 (9)0.59 (7)2.1 (8)0.73 (8)34.6 (6)1-NN

    678.6 (3)0.49 (11)1.1 (6)0.74 (9)55.6 (10)ZeroR

    PCGROC

    Area

    Log

    Loss

    Sqr

    Loss

    ErrorAlgorithm

  • 7/27/2019 Cake Talk Probability Forecasting

    29/46

    Correlations of scores

    Inverse No-0.1ROC vs. Sqr

    Reliability

    Direct Weak0.26PCG vs. Error

    Direct No0.04PCG vs. Sqr

    Resolution

    Direct Strong0.76PCG vs. Sqr

    Reliability

    InterpretationCorr. Coeff.Scores

    Inverse Moderate-0.52ROC vs. Error

    Direct Strong0.67ROC vs. Sqr

    Resolution

  • 7/27/2019 Cake Talk Probability Forecasting

    30/46

    Back to the plan

    What is probability forecasting? Reliabilityand resolution criteria

    Experimental design

    Problems with traditional assessment methods:square loss, log loss and ROC curves

    Probability Calibration Graph (PCG)

    Traditional learners are unreliable yet accurate!

    Extension ofVenn Probability Machine (VPM)

    Which learners are reliable?

    Psychological and theoretical viewpoint

  • 7/27/2019 Cake Talk Probability Forecasting

    31/46

    What is the VPM meta-learner?

    Volodyas VPM1. Predicts a label

    2. Produces upperu and lowerlbounds forpredicted label only

    My VPM extension1. Extracts more information

    2. Produces probability forecast forallpossible labels

    3. Predicts a label using these probability forecasts.

    4. Produces Volodyas bounds as well!

    Learner

    VPM meta

    learning

    framework

    VPM sits on top of existing learner to complement

    predictions with probability estimates

  • 7/27/2019 Cake Talk Probability Forecasting

    32/46

    Volodyas original use of VPM

    Online Trial Number

    Errorrate

    andbounds

    22.1%1414.1Low Error

    28.9%1835Error

    34.7%2216.5Up Error

    Upper (red) andlower (green)

    bounds lie above

    and belowthe

    actual number of

    errors (black)

    made on the

    data.

  • 7/27/2019 Cake Talk Probability Forecasting

    33/46

    Output from VPM compared with

    that of original underlying learner

    Key: Predicted = underlined , Actual =

    NANA7.6e-

    9

    6.3e-

    10

    4.0e-112.2e-91.3e-

    9

    0.071.7e-

    13

    2.9e-90.935831

    NANA2.2e-

    4

    2.2e-

    7

    0.20.460.162.3e-

    5

    0.170.019.4e-52490

    NANA1.3e-44.1e-103.4e-34.2e-30.994.4e-53.3e-64.5e-63.08e-91653

    Nave Bayes

    LowUpDysp.Renal.

    PancrIntest

    obstr

    CholiNon.

    Spec

    Perf.

    Pept.

    Div.Appx

    BoundsProbability forecast for each class labelTrial#

    0.410.680.010.010.00.010.010.420.00.010.535831

    0.070.710.40.090.080.150.050.070.100.030.022490

    0.080.820.090.010.040.00.730.080.030.00.031653

    VPM Nave Bayes

  • 7/27/2019 Cake Talk Probability Forecasting

    34/46

    Back to the plan

    What is probability forecasting? Reliabilityand resolution criteria

    Experimental design

    Problems with traditional assessment methods:square loss, log loss and ROC curves

    Probability Calibration Graph (PCG)

    Traditional learners are unreliable yet accurate!

    Extension ofVenn Probability Machine (VPM)

    Which learners are reliable?

    Psychological and theoretical viewpoint

  • 7/27/2019 Cake Talk Probability Forecasting

    35/46

    ZeroR

    Heart Disease Lymphography Diabetes

    ZeroR outputs probability forecasts which are mere label

    frequencies

    ZeroR predicts the majority class labelat each trial.

    Uses no information about the objects in its learning the

    simplest of all learners.

    Accuracy is poor, but reliability is good.

  • 7/27/2019 Cake Talk Probability Forecasting

    36/46

    K-NN

    10-NN 20-NN 30-NN

    K-NN finds subset of K closest (nearest neighbouring)

    examples in training data using a distance metric. Then

    counts the label frequencies amongst this subset.

    Acts like a more sophisticated version of ZeroR that usesinformation held in the object.

    Appropriate choice of K must be made to obtain reliable

    probability forecasts (depends on data).

    Traditional Learners and VPM

  • 7/27/2019 Cake Talk Probability Forecasting

    37/46

    Traditional Learners and VPM Traditional learners can be very unreliable (yet accurate) - depends on

    data.

    My research shows empirically that VPM is reliable.

    And it can recalibrate a learners original probability forecasts to make themmore reliable!

    Improvement in reliability often without detrimentto classification accuracy.

    Nave Bayes

    VPM Nave Bayes

    C4.5

    VPM C4.5

    Neural Net

    VPM Neural Net

    1-NN

    VPM 1-NN

  • 7/27/2019 Cake Talk Probability Forecasting

    38/46

    Back to the plan

    What is probability forecasting? Reliabilityand resolution criteria

    Experimental design

    Problems with traditional assessment methods:square loss, log loss and ROC curves

    Probability Calibration Graph (PCG)

    Traditional learners are unreliable yet accurate!

    Extension ofVenn Probability Machine (VPM)

    Which learners are reliable?

    Psychological and theoretical viewpoint

  • 7/27/2019 Cake Talk Probability Forecasting

    39/46

    Psychological Heuristics

    When faced with the difficult task of judgingprobability, people employ a limited number ofheuristics which reduce the judgements tosimpler ones:

    Availability - An event is predicted more likely tooccur if it has occurred frequently in the past

    Representativeness - One compares the essentialfeatures of the event to those of the structure ofprevious events

    Simulation - The ease in which the simulation of asystem of events reaches a particular state can beused to judge the propensity of the (real) system toproduce that state.

  • 7/27/2019 Cake Talk Probability Forecasting

    40/46

    Interpretation of reliable learners

    using heuristics ZeroR, K-NN and VPM learners are

    reliable probability forecasters.

    Can identify heuristics in these learningalgorithms

    Remember psychological research states:

    More heuristics More reliable forecasts

  • 7/27/2019 Cake Talk Probability Forecasting

    41/46

    Psychological Interpretation of

    ZeroR The simplest of all reliable probability

    forecasters uses 1 heuristic:

    The learner merely counts labels it hasobserved so far, and uses the frequencies of

    labels as its forecasts (Availability)

  • 7/27/2019 Cake Talk Probability Forecasting

    42/46

    Psychological Interpretation of

    K-NN More sophisticated than the ZeroR

    learner, the K-NN learner uses 2

    heuristics:Uses the distance metric to find subset of K

    closest examples in training set.

    (Representativeness)

    Then counts the label frequencies in the

    subset of K-nearest neighbours to makes its

    forecasts (Availability)

  • 7/27/2019 Cake Talk Probability Forecasting

    43/46

    Psychological Interpretation of

    VPM Even more sophisticated the VPM meta-learner uses all 3 heuristics:

    The VPM tries each new test example with all

    possible classifications (Simulation)

    Then under each tentative simulation clusters

    training examples which are similar into

    groups (Representativeness)Finally the VPM calculates the frequency of

    labels in each of these groups to make its

    forecasts (Availability)

  • 7/27/2019 Cake Talk Probability Forecasting

    44/46

    Theoretical justifications

    ZeroR can be proven to be asymptotically

    reliable (but experiments show well in

    finite data) K-NN has lots of theory Stone (1977) to

    support its convergence to true probability

    distribution VPM has a lots of theoretical justification

    for finite data using martingales

  • 7/27/2019 Cake Talk Probability Forecasting

    45/46

    Take home points Probability forecasting is useful for real life

    applications especially medicine.

    Want learners to be reliable and accurate.

    PCG can be used to check reliability.

    ZeroR, K-NN and VPM provide consistentlyreliable probability forecasts.

    Traditional learners Nave Bayes, Neural Net

    and Decision Tree can provide unreliableforecasts.

    VPM can be used to improve reliability ofprobability forecasts without detriment to

    classification accuracy.

  • 7/27/2019 Cake Talk Probability Forecasting

    46/46

    SupervisionAlex Gammerman

    Volodya Vovk

    Zhiyuan Luo

    Mathematical AdviceDaniil Riabko

    Volodya Vovk

    Teo Sharia

    ProofreadingZhiyuan Luo

    Sin Cox

    Graphics & DesignSin Cox

    CateringSin Cox

    Fin Acknowledgments