mml, inverse learning and medical data-sets pritika sanghi supervisors: a./prof. d. l. dowe dr p. e....

MML, inverse learning and medical data-sets

Pritika Sanghi

Supervisors: A./Prof. D. L. DoweDr P. E. Tischer

2

Overview

What is this project about? Bayesian Networks and their limitations Some techniques

Factor Analysis Minimum Message Length (MML) Decision Trees & Graphs Logistic Regression

Improving Bayesian Networks What is being done in this project?

3

What is this project about?

The aim of the project is to enhance Bayesian Networks in general and then apply them to certain medical data-sets.

These data-sets have a large number of attributes and small number of cases.

This makes it difficult to model these data-sets using Bayesian Networks.

4

Bayesian Networks

A popular tool for Data Mining.

Model data to infer the probability of a certain outcome.

They represent the frequency distributions for the values that an attribute can take as Conditional Probability Distributions.

P(WS)

0.75

P(GO)

0.50

WS GO P(S | WS, GO)

T T

T F

F T

F F

0.01

0.80

0.40

0.99

S P(A|S)

T

F

0.95

0.00

5

Bayesian Networks - Limitations When a child node depends on a large

number of parent attributes, the conditional probability distribution (CPD) becomes very complex. 2n rows in the CPD for n binary parent attributes.

This makes the process of creating the CPD and inferring something from it once created very time consuming.

A more compact representation for CPDs is required.

6

Factor Analysis

Multiple attributes may be defined by a common factor.

The Wallace and Freeman model for Single Factor Analysis will be implemented.

This serves as dimensionality reduction.

The validity of the program built will be checked using the data-sets specified in the Wallace and Freeman paper.

Attributes A and B have a common factor F1.

Attributes C, D and E have a common factor F2.

7

Factor AnalysisHeight-Weight of Footy Players

0

20

40

60

80

100

120

165 170 175 180 185 190 195 200 205

Height

Wei

gh

t

Weight

0

20

40

60

80

100

120

0 20 40 60 80 100 120

Actual Weight

Pre

dic

ted

Wei

gh

t

Height

165

170

175

180

185

190

195

200

205

165 170 175 180 185 190 195 200 205

Actual Height

Pre

dic

ted

Hei

gh

t

8

Factor Analysis

Data Attribute related term Standard Deviation

xnk = μk + аk νn + σk rnk

Mean Record related term Random variates N(0,1)

Size Height Weight

Large Tall AverageLarge Short Heavy

Medium Average AverageSmall Short Light

The equation for Single Factor analysis as defined by Wallace and Freeman is:

9

The Minimum Message Length (MML) Principle Models the data as a two-part message consisting of

hypothesis H and the data it encodes, D. The best model is the one with minimum message

length. This is done by maximising the posterior probability of

the hypothesis given the data, -log Pr(H|D), as the message length is negative log likelihood of the probability.

Message is represented as:

Hypothesis Data

10

Decision Trees and Graphs

Graphical way of representing the output attribute in terms of the input attributes.

Used to model the Conditional Probability Distribution of the Bayesian Network.

Graphs are generalisations of decision trees. They merge similar sub-trees.

11

Logistic Regression

Mathematical modelling approach used for describing the dependence of a variable on other attributes.

Will be used to define the probability of a discrete target attribute as a function of continuous attributes.

f(z) = 1 / (1+e-z) + c

12

Improving Bayesian Networks Comley and Dowe (2003, 2004) based on the

ideas from Dowe and Wallace (1998) commenced the work of enhancing Bayesian Networks and introduced Generalised Bayesian Networks.

This project will extend on their work by applying some of the techniques described before on Bayesian Networks.

13

What is being done in this project? Refinement to Generalised Bayesian Networks.

Specifically,First the MML - Single Factor Analysis will be added to Bayesian Networks.Then, Logistic Regression will be looked into.

The Generalised Bayesian Networks will then be used to infer models from some medical data-sets such as breast cancer data-sets.

If time permits, which it almost definitely won’t, other methods of dimensionality reduction and/or decision graphs will be pursued.

14

References

J W Comley and D L Dowe: General Bayesian Networks and Asymmetric Languages, Proceedings of the 2003 Hawaii International Conference on Statistics and Related Fields (HICS 2003), Honolulu, Hawaii, USA, 5-8 June 2003, ISSN: 1539-7211, pp 1 - 18.

J. W. Comley and D. L. Dowe: Minimum Message Length and Generalised Bayesian Nets with Asymmetric Languages, in P. D. Grunwald, I. J. Myung and M. A. Pitt (ed), Advances in Minimum Description Length: Theory and Applications, MIT Press. To be published 2004.

D L Dowe, C S Wallace: Kolmogorov complexity, minimum message length and inverse learning, in W Robb (ed), Proceedings of the Fourteenth Biennial Australian Statistical Conference (ASC-14), Queensland, Australia, 6-10 July, 1998, p 144.

C S Wallace and P R Freeman: Single factor analysis by MML estimation, J Royal Stat. Soc. B. 54, 1, 195-209, 1992.

15

More Information

http://www.monash.edu.au/~sanghi [email protected]

16

Thank You

Any questions?

mml, inverse learning and medical data-sets pritika sanghi supervisors: a./prof. d. l. dowe dr p. e....

Documents

factor analysis slide

model data

tischer slide

hypothesisdata slide

data mining

attributes c

bayesian networks comley

single factor analysis