bayesian treatment of incomplete discrete data applied to mutual information and feature selection...

Bayesian Treatment of Incomplete Bayesian Treatment of Incomplete Discrete Data applied to Mutual Discrete Data applied to Mutual Information and Feature SelectionInformation and Feature Selection

Marcus HutterMarcus Hutter & Marco Zaffalon & Marco Zaffalon

IDSIAIDSIAGalleria 2, 6928 Manno (Lugano), SwitzerlandGalleria 2, 6928 Manno (Lugano), Switzerland

www.idsia.ch/~{marcus,zaffalon}www.idsia.ch/~{marcus,zaffalon}{marcus,zaffalon}@idsia.ch{marcus,zaffalon}@idsia.ch

AbstractAbstract

Given the joint chances of a pair of random variables one can compute quantities of interest, like the mutual information. The Bayesian treatment of unknown chances involves computing, from a second order prior distribution and the data likelihood, a posterior distribution of the chances. A common treatment of incomplete data is to assume ignorability and determine the chances by the expectation maximization (EM) algorithm. The two different methods above are well established but typically separated. This paper joins the two approaches in the case of Dirichlet priors, and derives efficient approximations for the mean, mode and the (co)variance of the chances and the mutual information. Furthermore, we prove the unimodality of the posterior distribution, whence the important property of convergence of EM to the global maximum in the chosen framework. These results are applied to the problem of selecting features for incremental learning and naive Bayes classification. A fast filter based on the distribution of mutual information is shown to outperform the traditional filter based on empirical mutual information on a number of incomplete real data sets.

Incomplete data, Bayesian statistics, expectation maximization, global optimization, Mutual Information, Cross Entropy, Dirichlet distribution, Second order distribution, Credible intervals, expectation and variance of mutual information, missing data, Robust feature selection, Filter approach, naive Bayes classifier.

KeywordsKeywords

Mutual Information (MI)Mutual Information (MI)

Consider two discrete random variables (Consider two discrete random variables (,,))

(In)Dependence often measured by MI(In)Dependence often measured by MI

– Also known as Also known as cross-entropycross-entropy or or information gaininformation gain– ExamplesExamples

Inference of Bayesian nets, classification treesInference of Bayesian nets, classification trees Selection of relevant variables for the task at handSelection of relevant variables for the task at hand

,, of chancejoint jiij sjri ,,1 and ,,1

iiji of chance marginal j

jijj of chance marginal

i

ij

ji

ijijI

log0 π

MI-Based Feature-Selection Filter (F)MI-Based Feature-Selection Filter (F)Lewis, 1992Lewis, 1992

ClassificationClassification– Predicting the Predicting the classclass value given values of value given values of featuresfeatures– Features (or attributes) and class = random variablesFeatures (or attributes) and class = random variables– Learning the rule ‘features Learning the rule ‘features class’ from data class’ from data

Filters goal: removing irrelevant featuresFilters goal: removing irrelevant features– More accurate predictions, easier modelsMore accurate predictions, easier models

MI-based approachMI-based approach– Remove feature Remove feature if class if class does not depend on it: does not depend on it:– Or: remove Or: remove if if

is an arbitrary threshold of relevanceis an arbitrary threshold of relevance

0πI

πI

Empirical Mutual InformationEmpirical Mutual Informationa common way to use MI in practicea common way to use MI in practice

Data ( ) Data ( ) contingency table contingency table

– Empirical (sample) probability:Empirical (sample) probability:– Empirical mutual information: Empirical mutual information:

Problems of the empirical approachProblems of the empirical approach– due to random fluctuations? (finite sample)due to random fluctuations? (finite sample)– How to know if it is reliable, e.g. by How to know if it is reliable, e.g. by

jj\\ii 11 22 …… rr

11 nn1111 nn1212 …… nn1r1r

22 nn2121 nn2222 …… nn2r2r

ss nns1s1 nns2s2 …… nnsrsr

occurred times of# i,jnij

occurred times of# i nnj iji

occurred times of# j nni ijj

sizedataset ij ijnn

nnijij ̂ π̂I

0ˆ πI

?nIP

n

Incomplete SamplesIncomplete Samples

Missing features/classesMissing features/classes– Missing class: (i,?) Missing class: (i,?) n ni?i? = # features i with missing class = # features i with missing class

labellabel

– Missing feature: (?,j) Missing feature: (?,j) n n?j?j = # classes j with missing = # classes j with missing featurefeature

– Total sample size NTotal sample size Nijij=n=nijij+n+ni?i?+n+n?j?j

MAR assumption: MAR assumption: i?i?==i+ i+ , , ?j?j==+j+j

– General case: missing features and classGeneral case: missing features and class EM + closed-form leading order in NEM + closed-form leading order in N-1-1 expressions expressions

– Missing features onlyMissing features only Closed-form leading order expressions for Mean and VarianceClosed-form leading order expressions for Mean and Variance Complexity Complexity OO((rsrs))

We Need the Distribution of MIWe Need the Distribution of MI

Bayesian approachBayesian approach– Prior distribution for the unknown chances Prior distribution for the unknown chances

(e.g., Dirichlet) (e.g., Dirichlet) – Posterior: Posterior:

Posterior probability density of MI:Posterior probability density of MI:

How to compute it?How to compute it?– Fitting a curve using mode and approximate varianceFitting a curve using mode and approximate variance

πp

j

nji

niij

nij

jiijpp ?? πnπ

πππ dpIIIp nn

Mean and Variance of Mean and Variance of and I and I(missing features only)(missing features only)

Exact mode = leading meanExact mode = leading mean

Leading covariance:Leading covariance:

withwith

Exact mode = = leading order meanExact mode = = leading order mean

Leading variance:Leading variance:

Missing features & classes: EM converges globally, since p(Missing features & classes: EM converges globally, since p(|n) is |n) is unimodalunimodal

)([ˆ 1

NOEn

n

N

N

i

ijijij

)(][)ˆ( 1 NOIEI

][1 ??

?))(( Q

QQ

NCov kkliij

ikii

klijjlikijklij

ij ji

ijijKPQJK

NIVar 22 )

ˆˆ

ˆ(log:],/[

1

?

2

?

2

??

??

ˆ,

ˆ,:,:

i

ii

ij

ijiji

ii

ii

ii n

Nn

NQQQ

ij ji

ijij

iiii

i i

ii JQJJQJ

P

ˆˆ

ˆlog:,:,: ?

?

?2

MI Density Example GraphsMI Density Example Graphs(complete sample)(complete sample)

Distribution of Mutual Information for Dirichlet Priors

0

1

2

3

4

5

6

7

8

9

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

I

p(I|n

)

Exact n=[(40,10),(20,80)]

Gauss n=[(40,10),(20,80)]

Gamma n=[(40,10),(20,80)]

Beta n=[(40,10),(20,80)]

Exact n=[(8,2),(4,16)]

Gauss n=[(8,2),(4,16)]

Gamma n=[(8,2),(4,16)]

Beta n=[(8,2),(4,16)]

Exact n=[(20,5),(10,40)]

Gauss n=[(20,5),(10,40)]

Gamma n=[(20,5),(10,40)]

Beta n=[(20,5),(10,40)]

Robust Feature SelectionRobust Feature Selection

Filters: two new proposalsFilters: two new proposals– FF: include feature FF: include feature iff iff

(include iff “proven” relevant)(include iff “proven” relevant)

– BF: exclude feature BF: exclude feature iff iff (exclude iff “proven” irrelevant) (exclude iff “proven” irrelevant)

ExamplesExamples

95.0 nIP

95.0 nIP

FF includesBF includes

FF excludesBF includes

FF excludesBF excludes

I

Comparing the FiltersComparing the Filters

Experimental set-upExperimental set-up– Filter (F,FF,BF) + Naive Bayes classifierFilter (F,FF,BF) + Naive Bayes classifier– Sequential learning and testingSequential learning and testing

Collected measures for each filterCollected measures for each filter– Average # of correct predictions (prediction accuracy)Average # of correct predictions (prediction accuracy)– Average # of features usedAverage # of features used

Naive Bayes

Classification

Test

in

stance

Filter

Inst

ance

k

Inst

ance

k+

1

Inst

ance

N

Learningdata

Store after

classi

ficatio

n

Results on 10 Complete DatasetsResults on 10 Complete Datasets

# of used features# of used features

Accuracies NOT significantly differentAccuracies NOT significantly different– Except Chess & Spam with FFExcept Chess & Spam with FF

# Instances # Features Dataset FF F BF690 36 Australian 32.6 34.3 35.9

3196 36 Chess 12.6 18.1 26.1653 15 Crx 11.9 13.2 15.0

1000 17 German-org 5.1 8.8 15.22238 23 Hypothyroid 4.8 8.4 17.13200 24 Led24 13.6 14.0 24.0148 18 Lymphography 18.0 18.0 18.0

5800 8 Shuttle-small 7.1 7.7 8.01101 21611 Spam 123.1 822.0 13127.4435 16 Vote 14.0 15.2 16.0

Results on 10 Complete Datasets - Results on 10 Complete Datasets - ctdctd

0%

20%

40%

60%

80%

100%

Aust

ralia

n

Ches

s

Crx

Ger

man

-org

Hyp

oth

yroid

Led24

Lym

phogra

phy

Shutt

le-s

mal

l

Spam

Vote

FFF

BF

Percentages of used features

FF: Significantly Better AccuraciesFF: Significantly Better Accuracies

ChessChess

SpamSpam

0.7

0.8

0.9

1

0

30

0

60

0

90

0

12

00

15

00

18

00

21

00

24

00

27

00

30

00

Instance number

Pre

dic

tio

n a

cc

ura

cy

(C

he

ss

) FF

F

0.5

0.6

0.7

0.8

0.9

1

0

10

0

20

0

30

0

40

0

50

0

60

0

70

0

80

0

90

0

10

00

11

00

Instance number

Pre

dic

tio

n a

cc

ura

cy

(S

pa

m)

F

FF

BF

0

11000

22000

0

10

0

20

0

30

0

40

0

50

0

60

0

70

0

80

0

90

0

10

00

11

00

Instance number

Av

er.

nu

mb

er

of

ex

clu

de

d f

ea

ture

s (

Sp

am

)

F

FF

BF

Results on 5 Incomplete Data SetsResults on 5 Incomplete Data Sets

0%20%40%60%80%100%

Aud

iolo

gy Crx

Hor

se-C

olic

Hyp

othy

roid

loss

Soyb

ean-

larg

e

FF

F

BF

Percentages of used features

0.9

0.92

0.94

0.96

0.98

1

0

30

0

60

0

90

0

12

00

15

00

18

00

21

00

24

00

27

00

30

00

Instance number

Pre

dic

tio

n a

cc

ura

cy

(H

yp

oth

yro

idlo

ss)

F

FF

# Instances # Features # miss.vals Dataset FF F BF

226 69 317 Audiology 64.3 68.0 68.7

690 15 67 Crx 9.7 12.6 13.8

368 18 1281 Horse-Colic 11.8 16.1 17.4

3163 23 1980 Hypothyroidloss 4.3 8.3 13.2

683 35 2337 Soybean-large 34.2 35.0 35.0

ConclusionsConclusions

Expressions for several moments of Expressions for several moments of and MI and MI distribution even for incomplete categorical datadistribution even for incomplete categorical data– The distribution can be approximated wellThe distribution can be approximated well– Safer inferences, same computational complexity of Safer inferences, same computational complexity of

empirical MIempirical MI– Why not to use it?Why not to use it?

Robust feature selection shows power of MI Robust feature selection shows power of MI distributiondistribution– FF outperforms traditional filter FFF outperforms traditional filter F

Many useful applications possibleMany useful applications possible– Inference of Bayesian netsInference of Bayesian nets– Inference of classification treesInference of classification trees– ……

bayesian treatment of incomplete discrete data applied to mutual information and feature selection...

Documents

mutual information mi

missing data

leading variance

variance of mutual information

empirical mutual information

values of features features

r s n s1 n s2 n sr slide

order distribution