mml, inverse learning and medical data-sets pritika sanghi supervisors: a./prof. d. l. dowe dr p. e....

16
MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

Post on 20-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

MML, inverse learning and medical data-sets

Pritika Sanghi

Supervisors: A./Prof. D. L. DoweDr P. E. Tischer

Page 2: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

2

Overview

What is this project about? Bayesian Networks and their limitations Some techniques

Factor Analysis Minimum Message Length (MML) Decision Trees & Graphs Logistic Regression

Improving Bayesian Networks What is being done in this project?

Page 3: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

3

What is this project about?

The aim of the project is to enhance Bayesian Networks in general and then apply them to certain medical data-sets.

These data-sets have a large number of attributes and small number of cases.

This makes it difficult to model these data-sets using Bayesian Networks.

Page 4: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

4

Bayesian Networks

A popular tool for Data Mining.

Model data to infer the probability of a certain outcome.

They represent the frequency distributions for the values that an attribute can take as Conditional Probability Distributions.

P(WS)

0.75

P(GO)

0.50

WS GO P(S | WS, GO)

T T

T F

F T

F F

0.01

0.80

0.40

0.99

S P(A|S)

T

F

0.95

0.00

Page 5: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

5

Bayesian Networks - Limitations When a child node depends on a large

number of parent attributes, the conditional probability distribution (CPD) becomes very complex. 2n rows in the CPD for n binary parent attributes.

This makes the process of creating the CPD and inferring something from it once created very time consuming.

A more compact representation for CPDs is required.

Page 6: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

6

Factor Analysis

Multiple attributes may be defined by a common factor.

The Wallace and Freeman model for Single Factor Analysis will be implemented.

This serves as dimensionality reduction.

The validity of the program built will be checked using the data-sets specified in the Wallace and Freeman paper.

Attributes A and B have a common factor F1.

Attributes C, D and E have a common factor F2.

Page 7: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

7

Factor AnalysisHeight-Weight of Footy Players

0

20

40

60

80

100

120

165 170 175 180 185 190 195 200 205

Height

Wei

gh

t

Weight

0

20

40

60

80

100

120

0 20 40 60 80 100 120

Actual Weight

Pre

dic

ted

Wei

gh

t

Height

165

170

175

180

185

190

195

200

205

165 170 175 180 185 190 195 200 205

Actual Height

Pre

dic

ted

Hei

gh

t

Page 8: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

8

Factor Analysis

Data Attribute related term Standard Deviation

xnk = μk + аk νn + σk rnk

Mean Record related term Random variates N(0,1)

Size Height Weight

Large Tall AverageLarge Short Heavy

Medium Average AverageSmall Short Light

The equation for Single Factor analysis as defined by Wallace and Freeman is:

Page 9: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

9

The Minimum Message Length (MML) Principle Models the data as a two-part message consisting of

hypothesis H and the data it encodes, D. The best model is the one with minimum message

length. This is done by maximising the posterior probability of

the hypothesis given the data, -log Pr(H|D), as the message length is negative log likelihood of the probability.

Message is represented as:

Hypothesis Data

Page 10: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

10

Decision Trees and Graphs

Graphical way of representing the output attribute in terms of the input attributes.

Used to model the Conditional Probability Distribution of the Bayesian Network.

Graphs are generalisations of decision trees. They merge similar sub-trees.

Page 11: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

11

Logistic Regression

Mathematical modelling approach used for describing the dependence of a variable on other attributes.

Will be used to define the probability of a discrete target attribute as a function of continuous attributes.

f(z) = 1 / (1+e-z) + c

Page 12: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

12

Improving Bayesian Networks Comley and Dowe (2003, 2004) based on the

ideas from Dowe and Wallace (1998) commenced the work of enhancing Bayesian Networks and introduced Generalised Bayesian Networks.

This project will extend on their work by applying some of the techniques described before on Bayesian Networks.

Page 13: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

13

What is being done in this project? Refinement to Generalised Bayesian Networks.

Specifically,First the MML - Single Factor Analysis will be added to Bayesian Networks.Then, Logistic Regression will be looked into.

The Generalised Bayesian Networks will then be used to infer models from some medical data-sets such as breast cancer data-sets.

If time permits, which it almost definitely won’t, other methods of dimensionality reduction and/or decision graphs will be pursued.

Page 14: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

14

References

J W Comley and D L Dowe: General Bayesian Networks and Asymmetric Languages, Proceedings of the 2003 Hawaii International Conference on Statistics and Related Fields (HICS 2003), Honolulu, Hawaii, USA, 5-8 June 2003, ISSN: 1539-7211, pp 1 - 18.

J. W. Comley and D. L. Dowe: Minimum Message Length and Generalised Bayesian Nets with Asymmetric Languages, in P. D. Grunwald, I. J. Myung and M. A. Pitt (ed), Advances in Minimum Description Length: Theory and Applications, MIT Press. To be published 2004.

D L Dowe, C S Wallace: Kolmogorov complexity, minimum message length and inverse learning, in W Robb (ed), Proceedings of the Fourteenth Biennial Australian Statistical Conference (ASC-14), Queensland, Australia, 6-10 July, 1998, p 144.

C S Wallace and P R Freeman: Single factor analysis by MML estimation, J Royal Stat. Soc. B. 54, 1, 195-209, 1992.

Page 15: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

15

More Information

http://www.monash.edu.au/~sanghi [email protected]

Page 16: MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer

16

Thank You

Any questions?