a framework for temporal abstractive multidimensional data...

University of Western Sydney School of Computing and Mathematics

A Framework for Temporal

Abstractive Multidimensional Data

Mining

Heidi Bjering Stratti

A dissertation submitted in fulfilment of the requirements of

Master of Science (Hons)

2008

Acknowledgements There are many people who deserve acknowledgement for their role in this thesis

becoming a reality. Juggling research while working full time and being there for my

two children has been quite a difficult balancing act which I have not always

balanced very well. This last year has been a very difficult year for my family and

me, and I had many moments where I was unable to see how I would manage to

finish this research work.

First and foremost, the person who has had the most impact in this thesis becoming a

reality, despite being located in Canada, is my principal supervisor Dr Carolyn

McGregor. I could not have done this without her fantastic guidance, encouragement

and support. She got me through to the finish line during what has been one of my

toughest years. When I was feeling at my lowest and was completely discouraged,

she managed within an hour on Skype to get me going again. Never once a negative

word (even when I thoroughly deserved it). I would like to thank Carolyn for

providing me with academic guidance as well as friendship through these last couple

of years.

I also wish to thank my co-supervisor, Dr Mark Tracy, who came on board as my co-

supervisor as Carolyn relocated to Canada.

My fantastic sons, Robbie and Daniel, keep on impressing me with their take on life.

I wish to thank them for their support and putting up with me while I have been

doing this work.

I also would like to thank Darren for being by my side through this very challenging

year, for putting up with all my stress, and for the many dinners he has cooked for

me.

My friend Alphia has always been there ready to listen when I needed to talk. I thank

Alphia for her friendship, support and encouragement – and the many refreshing

walks we have had to clear our heads.

Statement of Authentication

The work presented in this thesis is, to the best of my knowledge and belief, original

except as acknowledged in the text. I hereby declare that I have not submitted this

material, either in full or in part, for a degree at this or any other institution.

…………………………………………………………

Heidi Bjering Stratti

i

Table of Contents Acknowledgements ...................................................................................................... 2 Statement of Authentication......................................................................................... 4 Table of Contents .......................................................................................................... i List of Tables .............................................................................................................. iv Abstract ......................................................................................................................vii 1. Chapter 1 – Introduction ...................................................................................... 1

1.1 Why is temporal abstractive data mining important to health and medicine2 1.2 Research motivation..................................................................................... 4 1.3 Research aims and objectives....................................................................... 5 1.4 Contribution to knowledge........................................................................... 6 1.5 Research method .......................................................................................... 7 1.6 Thesis overview ........................................................................................... 8

2. Chapter 2 – Literature Review ........................................................................... 10 2.1 Introduction................................................................................................ 10 2.2 Knowledge Discovery in Data ................................................................... 11 2.3 KDD and Intelligent Data Analysis ........................................................... 16 2.4 Data Mining ............................................................................................... 17 2.5 Application of Data Mining within Medicine............................................ 19 2.6 Temporal Abstraction ................................................................................ 23 2.7 Data Mining and Temporal Abstraction .................................................... 28

2.7.1 Review Method .................................................................................. 29 2.7.2 Review Results................................................................................... 30

2.8 Conclusions and impact on future research ............................................... 38 3. Chapter 3 – The NICU environment.................................................................. 39

3.1 The Australian Context .............................................................................. 39 3.2 Admittance to NICU .................................................................................. 40 3.3 Medical Devices and NICU monitoring .................................................... 42 3.4 Alerts .......................................................................................................... 45 3.5 Known Physiological Onset Predictors...................................................... 46 3.6 Similarities between NICU and other environments ................................. 47

4. Chapter 4 – Solution Manager Service and e-Baby Architecture ...................... 49 4.1 e-Baby Architecture ................................................................................... 50 4.2 Solution Manager Service .......................................................................... 53 4.3 Conclusion and Implication on this Research............................................ 57

5. Chapter 5 – Design of methodology for multidimensional TA DM framework (TAMDDM)............................................................................................................... 59

5.1 Multi Agent System ................................................................................... 61 5.1.1 Database Access Server ..................................................................... 65 5.1.2 Processing Agent................................................................................ 66 5.1.3 Temporal Agent ................................................................................. 67 5.1.4 Relative Agent.................................................................................... 68 5.1.5 Functional Agent................................................................................ 70 5.1.6 Rules Generating Agent ..................................................................... 71

5.2 Extended CRISP-DM Model ..................................................................... 72 5.2.1 Data Understanding............................................................................ 75 5.2.2 Data Preparation................................................................................. 76 5.2.3 Modeling ............................................................................................ 77

ii

5.2.4 Modelling: DM Rule-Set Generation and Select Significant Rule-Set 77 5.2.5 Modelling: Formulate Null Hypothesis ............................................. 78 5.2.6 Modelling: Run Statistical Processes to test Null Hypothesis ........... 79 5.2.7 Evaluation: Load Accepted Rule-Sets into IDSS............................... 80

5.3 TAMDDM Framework Tasks.................................................................... 81 5.3.1 Local Collection and Cleanup............................................................ 81 5.3.2 Temporal Abstractions, Simple & Complex, Multi-Stream .............. 82 5.3.3 Relative Alignment ............................................................................ 83 5.3.4 Exploratory Data Mining across multiple streams for multiple patients 85 5.3.5 Confirmatory Data Mining with Null Hypothesis ............................. 87 5.3.6 Hypothesis/Rule generation ............................................................... 88

5.4 Data storage................................................................................................ 89 5.4.1 Clinical Data ...................................................................................... 89 5.4.2 Diagnosis............................................................................................ 90 5.4.3 Patient Diagnoses ............................................................................... 90 5.4.4 Physiological Data ............................................................................. 91 5.4.5 Physiological Parameter..................................................................... 92 5.4.6 Temporal Rules .................................................................................. 92 5.4.7 Temporal Data.................................................................................... 93 5.4.8 Relative Rule...................................................................................... 94 5.4.9 Relative Temporal Abstractions......................................................... 95 5.4.10 RuleBase Data.................................................................................... 96

5.5 Conclusion ................................................................................................. 96 6. Chapter 6 – Demonstration within the NICU context........................................ 98

6.1 Processing Agent...................................................................................... 100 6.1.1 Mapping external source data to DBAS data structure.................... 103

6.1.1.1 Mapping external source table Baby to DBAS table Baby:......... 103 6.1.1.2 Mapping external source table Diagnosis to DBAS table BabyConditions............................................................................................ 104 6.1.1.3 Mapping external source table Condition to DBAS table Conditions 105 6.1.1.4 Mapping external source flat file physiologicalData to DBAS table PhysData 105

6.1.2 Mapping from DBAS data structure to internal TAMDDM data stores 106

6.1.2.1 Mapping DBAS table Baby to TAMDDM table Patient ............. 107 6.1.2.2 Mapping DBAS table PhysData to TAMDDM table PatientPhysiologicalParameter..................................................................... 107 6.1.2.3 Mapping DBAS table MFC to TAMDDM table PhysiologicalParameter................................................................................ 107 6.1.2.4 Mapping DBAS table BabyCondition to TAMDDM table PatientDiagnosis........................................................................................... 108 6.1.2.5 Mapping DBAS table Conditions to TAMDDM table Diagnosis108

6.2 Temporal Agent ....................................................................................... 108 6.3 Relative Agent.......................................................................................... 116 6.4 Functional Agent...................................................................................... 121 6.5 Rules Generating Agent ........................................................................... 123 6.6 Future Research Application.................................................................... 123

iii

6.7 Issues associated with Case Study ........................................................... 125 6.8 Impact of Demonstration.......................................................................... 125 6.9 Conclusion ............................................................................................... 126

7. Chapter 7 – Conclusion.................................................................................... 127 7.1 Summary .................................................................................................. 127 7.2 Contributions............................................................................................ 130 7.3 Limitations and further research .............................................................. 131 7.4 Impact....................................................................................................... 132 7.5 Conclusion ............................................................................................... 133

References ................................................................................................................ 134

iv

List of Tables Table 2.1: Clinical Context ........................................................................................ 34 Table 2.2: Data Mining and Temporal Abstraction Technique ................................. 37 Table 2.3: Clinical knowledge and Null Hypothesis testing...................................... 38 Table 3.1: NSW/ACT Hospital Classification of NICU Services ............................. 40 Table 3.2: Apgar score ............................................................................................... 41 Table 3.3: Component Monitoring System (CMS) data types................................... 46 Table 6.1: Mapping of Baby table (source) to Baby table (DBAS)......................... 104 Table 6.2: Mapping of Diagnosis table (source) to BabyConditions table (DBAS) 105 Table 6.3: Mapping of Condition table (source) to Conditions table (DBAS) ........ 105 Table 6.4: Mapping of physiological data in flat file (source) to PhysData table (DBAS) .................................................................................................................... 106 Table 6.5: Mapping of Baby table (DBAS) to Baby table (TAMDDM)................. 107 Table 6.6: Mapping of physiological data PhysData table (DBAS) to PatientPhysiologicalParameter table (TAMDDM).................................................. 107 Table 6.7: Mapping of MFC table (DBAS) to PhysiologicalParameter table (TAMDDM)............................................................................................................. 107 Table 6.8: Mapping of BabyCondition (DBAS) to PatientDiagnosis (TAMDDM) 108 Table 6.9: Mapping of Conditions table (DBAS) to Diagnosis table (TAMDDM) 108 Table 6.10: Raw SaO2 readings............................................................................... 110 Table 6.11: Abstractions created from all SaO2 readings in Table 6.10 ................. 112 Table 6.12: Blood pressure values ........................................................................... 112 Table 6.13: Abstractions created from all blood pressure readings in Table 6.12... 113 Table 6.14: Complex Abstractions.......................................................................... 115 Table 6.15: Temporal Abstractions using absolute start and end times................... 119 Table 6.16: relative temporal abstractions of abstractions from Table 6.11, using diagnosis time of 20061116_13:29:02.001 as the event of interest for realignment of the patients data........................................................................................................ 120

v

List of Figures Figure 1.1: Elements of constructive research ............................................................. 8 Figure 2.1: Overview of the steps constituting the KDD process (Fayyad, Piatetsky-Shapiro et al. 1996) .................................................................................................... 12 Figure 2.2: The phases of the CRISP-DM reference model (CRISP-DM 2000)....... 14 Figure 2.3: The six-step DMKD process model (Cios and Moore 2002).................. 15 Figure 2.4: Relationship between clinical research and clinical management........... 17 Figure 2.5: The Thirteen Possible Relationships (Allen 1983; Allen 1984).............. 24 Figure 3.1: NICU physiological monitor (Neonatology on the Web) ....................... 43 Figure 4.1: The e-Baby Architecture (McGregor, Heath et al. 2005)........................ 52 Figure 4.2: Solution Manager Service ....................................................................... 56 Figure 5.1: The TAMDDM Framework .................................................................... 60 Figure 5.2: Solution Manager Service (SMS)............................................................ 62 Figure 5.3: Multi-agent framework (Foster 2008) ..................................................... 62 Figure 5.4: The extended multi agent system. ........................................................... 64 Figure 5.5: The multi agent layer in the TAMDDM framework ............................... 65 Figure 5.6: Processing Agent ..................................................................................... 66 Figure 5.7: Temporal Agent....................................................................................... 67 Figure 5.8: Relative Agent ......................................................................................... 68 Figure 5.9: Functional Agent ..................................................................................... 70 Figure 5.10: Rules Generating Agent ........................................................................ 71 Figure 5.11: Parallelism between CRISP-DM and the Scientific Method (Heath 2006) .......................................................................................................................... 73 Figure 5.12: CRISP-DM extended for Clinical Practice and Research as proposed by Heath‘s thesis (2006) ................................................................................................. 74 Figure 5.13: Extended CRISP-DM model in the TAMDDM framework ................. 75 Figure 5.14: Data Understanding ............................................................................... 75 Figure 5.15: Data Preparation .................................................................................... 76 Figure 5.16: Modelling: DM Rule-set Generation and Select Significant Rule-set phases ......................................................................................................................... 77 Figure 5.17: Formulate Null Hypothesis phase ........................................................ 78 Figure 5.18: Run Statistical Processes to test Null Hypothesis ................................. 79 Figure 5.19: Load Accepted Rule-sets into IDSS phase ............................................ 80 Figure 5.20: Local Collection and Cleanup ............................................................... 81 Figure 5.21: Temporal Abstractions .......................................................................... 82 Figure 5.22: Relative Alignment................................................................................ 83 Figure 5.23: Data before relative alignment .............................................................. 84 Figure 5.24: Data after relative alignment ................................................................. 85 Figure 5.25: Exploratory Data Mining....................................................................... 85 Figure 5.26: Confirmatory Data Mining with Null Hypothesis................................ 87 Figure 5.27: Hypothesis/Rule generation.................................................................. 88 Figure 5.28: TAMDDM data store ............................................................................ 89 Figure 5.29: Patient Table .......................................................................................... 89 Figure 5.30: Diagnosis Table ..................................................................................... 90 Figure 5.31: PatientDiagnosis Table .......................................................................... 91 Figure 5.32: PatientPhysiologicalParameter Table .................................................... 91 Figure 5.33: PhysiologicalParameter Table ............................................................... 92 Figure 5.34: TA_Rule Table ...................................................................................... 92 Figure 5.35: TemporalAbstraction Table................................................................... 93

vi

Figure 5.36: Study Table............................................................................................ 94 Figure 5.37: TA_RelativeTime Table ........................................................................ 95 Figure 6.1: Result database structure passed from DBAS to agents in the multi-agent system (Foster 2008) ................................................................................................ 101 Figure 6.2: Structure of the source data regarding babies........................................ 102 Figure 6.3: Structure of the source physiological data............................................. 103 Figure 6.4: modified BabyCondtions table .............................................................. 104 Figure 6.5: Graphing of SaO2 values against a threshold of 90%........................... 111 Figure 6.6: Graphing of blood pressure values against a threshold of 24mm/Hg ... 113 Figure 6.7: Complex Abstraction............................................................................. 115 Figure 6.8: realignment of abstracted parameters relative to diagnosis................... 117

vii

Abstract

In the industrialised world, premature birth has been recognised as one of the most

significant perinatal health issues (Kramer, Platt et al. 1998). In Australia 8.1% of

babies are born before 37 weeks gestation (Laws, Abeywardana et al. 2007).

Premature babies often have prolonged stays in Neonatal Intensive Care Units

(NICUs) and can suffer from a number of different conditions during their stay.

Some of these conditions have been shown to exhibit certain variations in their

physiological parameters that can indicate the onset of such conditions, before it can

be detected by other means. Medical monitoring equipment produces large masses of

data, which makes analysing this data manually impossible. Adding to the

complexity of the large datasets is the nature of physiological monitoring data – the

data is multidimensional, where it is not only changes in individual dimensions that

are significant, but sometimes simultaneous changes in several dimensions. As the

time-series produced by the monitoring equipment is temporal, there is a need for

clinical research frameworks that enables both the dimensionality and temporal

behaviour to be preserved during data mining.

The aim of this research is to extend previous research that proposed a framework to

support analysis and trend detection in historical data from Neonatal Intensive Care

Unit (NICU) patients. The extensions contribute to fundamental data mining

framework research through the integration of temporal abstraction and support of

null hypothesis testing within the data mining processes. The application of this new

data mining approach is the analysis of level shifts and trends in historical temporal

data and to cross correlate data mining findings across multiple data streams for

viii

multiple neonatal intensive care patients in an attempt to discover new hypotheses

indicative of the onset of some condition. These hypotheses can then be evaluated

and defined as rules to be applied in the monitoring of neonates in real-time to enable

early detection of possible onset of conditions. This can assist in faster decision

making which in turn may avoid conditions developing into serious problems where

treatment may be futile.

This research employs a constructive research method. In this research, the problem

is the inability of current data mining frameworks to completely support clinical

research in multidimensional temporal data. This research has resulted in the design

of a temporal abstraction multidimensional data mining (TAMDDM) framework

suitable for clinical research in multidimensional temporal time series data. The

framework is demonstrated through a case study with neonatal intensive care

monitoring data.

1

1. Chapter 1 – Introduction This thesis presents a framework to enable multi-dimensional data mining on time

series data that exhibits temporal behaviours. This research is demonstrated through

a case study utilising physiological time series data streams, together with other

clinical data streams collected from patients in Neonatal Intensive Care Units

(NICUs) for the detection of trends and patterns in multi-dimensional stored

physiological data. The purpose of detecting these trends and patterns is to recognise

indicators for the onset of some condition in the neonate, to enable the creation of

hypotheses that can be transformed into rules suitable for use in intelligent

monitoring systems.

The large volumes of data present in medical data renders traditional manual data

analysis inadequate (Lavrac 1999). Data mining is the process of analysing large

amounts of data to find new patterns and relationships within the data, by using

techniques such as statistics, machine learning and pattern recognition. Multi-

dimensional data is data that consist of more than one variable. In physiological data

streams, multi-dimensional data means that for one patient there are values for

several different data streams, for example arterial oxygen saturation AND blood

pressure. Each of these data streams separately represents a single dimension. When

combined together the data becomes multi-dimensional. Temporal abstraction (TA)

is a technique used to summarise time series data to a higher level while preserving

context and time, usually adding qualitative information such as states and trends to a

particular abstraction. Sections of data which holds true for certain criteria, such as

high (or low), can be summarised with a start time and end time for when this criteria

is true, as well as a label (high or low) to describe the abstraction. This is particularly

2

relevant for a NICU setting, where it is usually trends or changes in the physiological

data over time, sometimes across multiple parameters, which are significant when

analysing and predicting patient conditions.

The remainder of this chapter is structured as follows; first temporal abstractive data

mining and its importance to health and medicine is discussed, before the motivation

for the research presented in this thesis is described. The discussion on motivation

for this research leads to the section outlining the research aims and objectives by

listing the research hypotheses for this thesis, before listing the contributions to

knowledge that have resulted from this work. The research method utilised for this

research is presented, before the overview of the content of this thesis concludes the

chapter.

1.1 Why is temporal abstractive data mining important to health and medicine

The medical industry generates large amounts of medical data, of which very little is

used for extracting useful information (Hanson 2006). For example, modern

equipment in intensive care units (ICUs) produce vast amount of data from monitors

connected to patients (Horn 2001). The situation is the same in NICUs where the

focus of this research lies. Babies in a NICU usually have a range of monitoring and

life support devices attached to them. These devices produce large amounts of data

which can be essential in deciding on treatment options. However, the amount of

data makes it very hard for the neonatologist to extract useful information. The data

is displayed on monitors, often in waveforms, and a static ‘picture’ is recorded by

staff at regular intervals. Between each manual recording there can be small changes

to the data which is never noted. These changes may be important in being able to

predict the onset of some condition. Sepsis, a common illness in neonates, has been

3

shown to exhibit changes in physiological data before the condition can be diagnosed

through blood cultures (Griffin and Moorman 2001; Griffin, O'Shea et al. 2003;

Griffin, O'Shea et al. 2004; Griffin, Lake et al. 2005; Griffin, Lake et al. 2007). This

indicates that subtle changes that may not be apparent through the current practice of

manual recording of the physiological data at regular intervals can be important in

detecting the onset of condition in neonates. There could also be indicators which

exhibit across multiple parameters that together indicate the onset of some condition.

This situation has created the need for systems aimed at clinical management to help

analyse the data produced by the monitoring and life support devices connected to

the babies. These systems look for certain trends or patterns that have previously

been defined, often by experts in the field. However, “human-defined rules risk

capturing the biases of one expert” (Lavrac, Keravnou et al. 2000), therefore clinical

research on historical data is needed to find new previously undiscovered trends and

patterns. Modern technology allows this data to be used as input in processes to

attempt to derive information and new knowledge, known as knowledge discovery in

data (KDD).

“The gap between data generation and data comprehension is widening in all

fields of human activity. In medicine, overcoming this gap is particularly

crucial since medical decision making needs to be supported by arguments

based on basic medical knowledge as well as knowledge, regularities and

trends extracted from data.” (Lavrac, Keravnou et al. 2000)

There is a need for frameworks to facilitate clinical research on stored historical

physiological patient monitoring data to enable the discovery of previously unknown

trends and patterns that may be indicative of the onset of some condition. Research in

this kind of data brings challenges: the volume of the data is massive; the data is

multidimensional, as for each patient there are several parameters which can not be

4

considered in isolation from each other; the data is temporal, which means that

individual data values do not necessarily provide much meaning, however when

these values are considered in relation to their neighbouring values they can provide

meaning. Therefore each individual value must be considered in both time and

context. Opportunities exist for the exploration of data to determine the existence of

pre onset behaviours for multiple conditions and diagnosis.

1.2 Research motivation

As will be shown in the literature review in chapter two, there is an absence of

flexible multidimensional approaches to data mining of time series data. Chapter two

also discusses the need for representation of temporal behaviour in this type of data

to enable this temporal behaviour to be preserved in the mining process, as often

individual data values by themselves do not have much meaning, however when

considered over time and context meaning can be derived.

Current monitoring systems used in NICUs are not capable of monitoring cross

correlated temporal rules in multiple data streams. Current research by Stacey and

McGregor (McGregor and Stacey 2007; Stacey, McGregor et al. 2007) in this area is

showing the possibilities of such systems becoming a reality. For these systems to be

effective, cross correlated rules for multiple parameters must be created for use by

the alarming component of the system. Currently such rules are created by domain

experts, however there is potential for yet undiscovered rules to be hiding in the vast

amounts of data produced by the monitoring equipment connected to the patient. To

enable creation of hypotheses that can be turned into such rules, there is need for a

framework that enables clinical research in this type of multi-dimensional temporal

5

data, where the rigour necessary for medical research, including null-hypothesis

testing, can be accommodated.

With a reduction in storage cost and a corresponding increase in the ability to collect

and store temporal data through real-time clinical monitoring, there comes the

opportunity to analyse collected data along time (Moskovitch 2007). This is

especially significant in clinical environments, where individual data elements in

clinical records may not be meaningful outside of a particular temporal context.

However, when considered over time and context, the values and their inter-

relationships can become significant. This is particularly true in acute care settings,

such as neonatal care, where it is usually trends or changes over time, sometimes

across multiple parameters, which are significant when predicting the onset of future

patient conditions. As an added complexity, patients who have the same condition

may have substantially different types and timing of observations, unlike retail data

which generally have comparable complements of data elements obtained at similar

times (Harrison Jr 2008).

1.3 Research aims and objectives

The issues of clinical research in physiological data together with a review of the

current functionality of data mining introduced in the two previous sections led to the

creation of 5 research hypotheses. The research hypotheses of this thesis are that:

1. A multidimensional data mining (MDDM) framework can be defined for

clinical research to discover trends and patterns indicative of the onset of

some condition.

6

2. The abovementioned framework will include methods for applying temporal

abstraction (TA) across multiple parameters for multiple patients to enable

mining of multi-dimensional temporal data

3. The TAMDDM framework can be applied in a neonatal context

4. The TAMDDM framework can support null hypothesis testing

5. The hypotheses generated by the framework can be used by a real-time event

stream processor analysing the current condition of babies in a Neonatal

Intensive Care Unit (NICU)

1.4 Contribution to knowledge

The areas of research contribution to knowledge resulting from this thesis are:

• Extensions to a multi agent framework previously designed for analysing time series data, to facilitate temporal abstraction and realignment of

these abstractions (as presented in chapter 5)

• Enable incorporation of the extended CRISP-DM model within the multi agent framework (as presented in chapter 5).

• Design of a framework to enable temporally abstractive multi dimensional data mining (as presented in chapter 5)

• Enhancement of the interaction between clinical research and clinical management by generating a framework for clinical research which can

produce hypotheses that will feed into intelligent monitoring systems used

in clinical management. The clinical research framework uses as input

data from the various monitoring equipment used in clinical management

processes (as presented in chapter 5 and chapter 6)

7

1.5 Research method

For this research a constructive research method is used. This is a research method

widely used in computer science disciplines, information systems, management

accounting and medical domains (Kasanen 1993; Curry 2000; Shaw 2001; Shapiro

2003). The key in constructive research is the development of a new construct in

response to an explicit problem (Lassenius, Soininen et al. 2001). The construct,

which can be a new model, software, theory, framework or algorithm, is tested for

usability and enables theoretical conclusions to be made. The aim is to take a real

world practical problem and produce a real world solution.

Lassenius et al (2001) describes 6 phases of the constructive research method. These

phases can be iterative and recursive:

1. Find a practically relevant problem

2. Obtain an understanding of the topic and the problem

3. Innovate, i.e., construct a solution idea

4. Demonstrate that the solution works

5. Show theoretical connections and research contribution

6. Examine the scope of applicability

For this research the practically relevant problem is the need to be able to use stored

clinical data across multiple parameters to identify trends and level shifts that can be

relevant in early recognition of problems for patients, in this case neonates.

Obtaining an understanding of the topic and the problem is achieved by reviewing

literature in the area of knowledge discovery in databases, data mining in medicine

8

and temporal abstraction. Designing a framework for the temporal abstraction and

data mining of multidimensional parameters completes the phase of constructing a

solution idea. Applying the framework tasks to examples from a neonatal intensive

care unit (NICU) demonstrates that the solution can be applied in a NICU setting.

Finally the thesis will show the theoretical connections and research contributions

made, as well as examine the scope of applicability in other areas.

CONSTRUCTION,

problem solving

Practical relevance

Theory connection

Practical functioning

Theoretical contribution

Figure 1.1: Elements of constructive research

The goal of constructive research, as described by Lassenius et al (Lassenius,

Soininen et al. 2001) is to produce “Innovative constructs, intended to solve

problems faced in the real world and, by that means, to make a contribution to the

theory of the discipline in which it is applied” (Lassenius, Soininen et al. 2001)

1.6 Thesis overview

Chapter 2 presents a literature review of the areas of influence for this thesis, mainly

knowledge discovery in databases (KDD), data mining and temporal abstraction. The

chapter explores these areas in their application to medical systems in particular, to

expose the open health informatics research areas that led to the formation of the

research hypotheses addressed by the techniques proposed in this research. In chapter

3, further background is provided by describing the neonatal intensive care unit

9

(NICU) environment which provides the setting for the Temporal Abstractive

MultiDimensional Data Mining (TAMDDM) framework designed in this thesis.

Before describing the design of the TAMDDM framework, chapter 4 gives an

introduction to the previous research that this research builds on, the e-Baby project

and the solution manager service (SMS), which contains the analytical processor

where the TAMDDM framework will reside. The TAMDDM framework

components, including the multi-agent system and its extensions, the extended

CRISP-DM model and the TAMDDM tasks are described. Chapter 6 demonstrates

how the TAMDDM framework can be used for conducting clinical research within

the NICU domain. Chapter 7 provides the conclusion to the thesis, summarising the

contributions of this work and provides direction for future research.

10

2. Chapter 2 – Literature Review

2.1 Introduction

The main motivation for the research in this thesis is the identified gap that exists

between clinical management and clinical research (Foster, McGregor et al. 2005).

This research is particularly interested in the intensive care unit (ICU) environment

where observation of the patient’s condition is supported through the provision of

several physiological time series data streams via medical monitors such as heart rate

and mean blood pressure. There exists the possibility of discovering new knowledge

that may exist in patient data, which can indicate the onset of some condition such as

sepsis (a blood poisoning condition).

Patient data, in particular monitoring data, is inherently temporal. Based on the

research hypotheses presented in Chapter 1, this literature review chapter will mainly

focus on temporal abstraction and data mining, which are part of the overall

knowledge discovery in data (KDD) process. Data mining is the chosen discovery

technique, and as patient data, in particular monitoring data, is naturally temporal, a

technique is needed to preserve the temporal aspect of the data when it is mined.

Therefore temporal abstraction forms part of the preprocessing step.

The remainder of this chapter is structured as follows. First the area of Knowledge

Discovery in Data is introduced. The relationship between Knowledge Discovery in

Data and Intelligent Data Analysis is presented. The data mining component within

Knowledge Discovery in Data is defined and the issues exposed by other researchers

relating to the application of Data Mining within the domain of medicine are

discussed. The concept of temporal abstraction is introduced. Previous researcher’s

11

use of temporal abstraction to support the preprocessing within data mining is

presented and analysed to uncover open research areas that led to the creation of the

research hypotheses for this thesis.

2.2 Knowledge Discovery in Data

Although the area of medical informatics is a relatively new research area,

application of analytical techniques in medicine has a long history. In the mid

eighteen hundreds researchers collected data and looked for patterns in an attempt to

provide some causal understanding of the pandemic in London at the time. (Brown

2008)

In current time, the ability to gather and store data is steadily increasing, resulting in

“data rich times” (Brown 2008). Efficient techniques are needed to enable the

analysing and understanding of these resources. The possibility of the discovery of

hidden knowledge in this massive amount of data is driving the development of the

field of Knowledge Discovery in Databases (KDD) and data mining. Fayyad and

Stolorz (1997) define data mining as “the application of specific algorithms for

extracting patterns from data”. To enable knowledge to be derived from the data, a

larger process needs to envelope the data mining step. This process includes data

preparation and selection, cleaning and preprocessing of the data, including any prior

knowledge about the data into the process, as well as interpretation of the mining

results. KDD refers to this overall process of discovering knowledge in data (Figure

2.1) where data mining is one of the steps, using specialised tools for this task

(Holmes and Peek 2007). Holmes and Peek (2007) states that KDD is a “process

where the data are used for hypothesis generation”, and provides a framework to

avoid “fishing expeditions”, also called data dredging.

12

Figure 2.1: Overview of the steps constituting the KDD process (Fayyad, Piatetsky-Shapiro et al.

1996)

With the large amounts of data collected by various applications and processes,

manual data analysis and interpretation is not feasible, can be very subjective and is

impractical as data volumes grow exponentially (Fayyad, Piatetsky-Shapiro et al.

1996). As Fayyad et al (1996) points out, “the true value of such data lies in the

users’ ability to extract useful reports, spot interesting events and trends, support

decisions and policy based on statistical analysis and inference, and exploit the data

to achieve business, operational, or scientific goals”. These user goals were a driver

behind the development of the KDD research area in the late eighties and early

nineties. The term KDD was conceived in 1989 to “emphasize that ‘knowledge’ is

the end product of a data-driven discovery” (Fayyad and Stolorz 1997). Knowledge

Discovery in Databases (KDD) is sometimes also referred to as Knowledge

Discovery in Data (Heath 2006).

Fayyad et al (1996) defines the KDD process as: “The nontrivial process of

identifying valid, novel, potentially useful, and ultimately understandable patterns in

data.” Here, pattern is referring to a subset of the data, or a model that is relevant to

that subset and should be valid for new data, and process is indicating several

iterative and interactive steps; data preparation, search for patterns, knowledge

evaluation, refinement.

13

Fayyad et al (1996) define 9 steps in the KDD process:

1. Learning the application domain

2. Creating a target dataset

3. Data cleaning and preprocessing

4. Data reduction and projection

5. Choosing the function of data mining

6. Choosing the data mining algorithm(s)

7. Data mining

8. Interpretation

9. Using discovered knowledge

In their book Data mining: A knowledge discovery approach (Cios, Swiniarski et al.

2007), the authors stress that knowledge discovery focuses on the whole process,

including before and after modeling, rather than just the data mining modeling part.

They call this the Knowledge Discovery Process (KDP), “a process that seeks new

knowledge about an application domain” (Cios, Swiniarski et al. 2007). They cite

Fayyad’s 9 step KDD model above as the leading research model for KDD, and the

CRISP-DM (Cross-Industry Standard Process for Data Mining) as the leading

industrial model. The CRISP-DM model consists of six phases (Figure 2.2):

1. Business Understanding

2. Data understanding

3. Data preparation

4. Modeling

5. Evaluation

6. Deployment

14

Figure 2.2: The phases of the CRISP-DM reference model (CRISP-DM 2000)

As Figure 2.2 illustrates through the outer circle with clockwise arrows, the CRISP-

DM model is iterative. A phase consists of four layers, namely:

1. Phase

2. Generic tasks

3. Specialised tasks

4. Process instance

Generic tasks are tasks that should be done for any data mining circumstance. The

specialised tasks illustrate how the generic tasks should be done in a particular

situation, while the process instance level records what occurred in a particular

deployment of a particular phase of the process model (CRISP-DM 2000).

15

CRISP-DM was developed in 1996, and the goal was to be “industry-, tool- and

application-neutral” (CRISP-DM 2000). A step-by-step data mining guide is

available at http://www.crisp-dm.org. The guide provides instructions for each level

of task within each phase (CRISP-DM 2000).

Cios and Moore (2002) discuss the six step DMKD process model (Figure 2.3)

which is an extension to the CRISP-DM, and is described in a later book on data

mining and knowledge discovery (Cios, Swiniarski et al. 2007) as a hybrid model,

combining facets from the CRISP-DM model and academic models.

Figure 2.3: The six-step DMKD process model (Cios and Moore 2002)

Cios et al (2007) present a comparison table comparing 5 KDD models. The

comparison demonstrates that although the number and names of steps in each model

varies, they all include understanding of the domain, preprocessing of the data, data

16

mining and evaluation. Data preparation is identified in all the models as the most

time consuming part of any KDD model.

KDD systems draws on research from a variety of fields, some of which are

databases, machine learning, pattern recognition, statistics, artificial intelligence and

reasoning, data visualisation and high performance computing (Fayyad, Piatetsky-

Shapiro et al. 1996). It is beyond the scope of this thesis to cover all the areas of

KDD. The main focus of this thesis is in the area of data preprocessing and data

mining, two of the steps in the overall process of KDD.

2.3 KDD and Intelligent Data Analysis

Some authors use the term Intelligent Data Analysis (IDA) when discussing the

KDD processes. According to Stacey and McGregor (2007), “KDD is primarily

concerned with learning new knowledge whereas IDA is directed toward application

of knowledge for data interpretation”. That is, IDA is utilising existing knowledge to

look for the existence of instances of that knowledge and take action for those

instances. For the research presented in this thesis, the primary interest is in

developing a framework for aiding the discovery of new knowledge, integrating the

benefits of temporal abstraction and data mining in this discovery process. However

it is also a goal that this new knowledge can be utilised in the intelligent data analysis

of real time streaming patient data, therefore any rules developed from hypotheses

created by the framework described in chapter 5 should feed back into the IDA

process described by Stacey et al (McGregor and Stacey 2007; Stacey, McGregor et

al. 2007), as depicted in Figure 2.4 below, showing the relationship between clinical

research and clinical management.

17

Stored

physiological &

clinical data

Monitoring devices

Clinical ResearchClinical Management

Physiological dataPhysiological data

Data validation

Temporal abstraction

Inference engine

Alerts

Temporal abstraction

Stored

abstractions

Temporal

Abstraction

Rules

Data mining

Rule

creation

Hypotheses

Domain knowledge

Inference

Rules

Alert

Rules

Rules

Figure 2.4: Relationship between clinical research and clinical management

2.4 Data Mining

One of the common parts in the various KDD processes discussed in the previous

section is the area of data mining. Data mining is the activity used in KDD for the

actual discovery of patterns and relationships in data. “Data mining involves fitting

models to or determining patters from observed data. The fitted models play the role

of inferred knowledge. Deciding whether or not the models reflect useful knowledge

is part of the overall interactive KDD process for which subjective human judgement

is usually required” (Fayyad, Piatetsky-Shapiro et al. 1996).

Data mining is used for many different problem types. The CRISP-DM data mining

guide (CRISP-DM 2000) offers the following list of problem types:

- Data description and summarisation

- Segmentation

18

- Concept descriptions

- Classification

- Prediction

- Dependency analysis

A variety of techniques are available to solve the different types of problems (CRISP-DM 2000):

- Clustering techniques

- Neural nets

- Visualization techniques

- Rule induction methods

- Conceptual clustering

- Discriminant analysis

- Rule induction methods

- Decision tree learning

- K-nearest neighbor

- Case-based reasoning

- Genetic algorithms

- Regression analysis

- Regression trees

- Box-Jenkins methods

- Correlation analysis

- Association rules

- Bayesian networks

- Inductive logic programming

19

Several techniques can be suitable for a particular type of problem (i.e., discriminant

analysis, rule induction methods, decision tree learning, neural nets, k nearest

neighbor, case-based reasoning, and genetic algorithms could all be appropriate

techniques to use for a classification problem (CRISP-DM 2000)). Similarly, a

particular technique can be suitable for more than one type of data mining problem.

Neural nets for example, is a suitable technique for segmentation, classification and

prediction problems (CRISP-DM 2000).

Data mining techniques are often described as supervised or unsupervised learning

(Bath 2004). Supervised learning involves using a training set of the data that

includes the ‘answer’ to the classification, before using a test data set without the

‘answer’ to be classified. Unsupervised learning gives no information to the data

mining tool about the classification; the data mining technique used creates the

classifications based on the data it is exposed to.

Cios and Moore (2002) considers data mining as a superset of statistics, as data

mining as well as drawing from statistics, also uses concepts from several other

disciplines, such as machine learning and database technology.

2.5 Application of Data Mining within Medicine

Heath (2006) argues that for KDD and data mining results to be accepted by

clinicians and the medical community, adaption must be made to introduce more

rigor in the form of scientific-method approach into the process, and to include

provisions for hypothesis creation and null hypothesis testing within the framework.

According to Heath (2006), clinicians are sceptical of DM results, largely because

20

current frameworks do not support the scientific method of Null Hypothesis testing.

The null hypothesis is usually created to be demonstrated as incorrect, in order to

support an alternative hypothesis. When used in medical experiments the null

hypothesis is typically stated as there being no significant difference between

compared groups. Null Hypothesis testing is used when conducting clinical trials,

and Heath states that “the null hypothesis driven medical research paradigm must

inform DM investigative methods in the medical domain” (Heath 2006).

Heath (2006) proposes the extended CRISP-DM model as a solution to this issue.

This model uses exploratory data mining as a tool to find unknown patterns or

relationship and creating hypotheses. Confirmatory data mining is subsequently used

for null hypothesis testing. To enable the use of the extended CRISP-DM when

discovering new trends and patterns in neonatal physiological data, a challenge exists

to further extend this model to include provisions for Temporal Abstraction (TA) and

for use on multidimensional parameters.

In the extended CRISP-DM model exploratory data mining could be performed using

a technique such as association rule mining, where the data can indicate a particular

symbolic rule in the form of IF … THEN …(Bath 2004):

IF condition(s) THEN conclusion

Or, Condition(s) ⇒ Conclusion

As pointed out by Heath (2006), these rules must be carefully considered and

analysed by domain experts before being used for prediction, and suggests the use of

rule interestingness measures for this evaluation. Rule interestingness measures is an

active research area in the field of data mining and KDD (Ohsaki, Sato et al. 2002;

21

Ohsaki, Sato et al. 2004; Ohsaki, Kitaguchi et al. 2005; Ohsaki, Abe et al. 2007).

Once a hypothesis has been created using exploratory data mining, a null hypothesis

can be formulated and tested using confirmatory/predictive data mining techniques

(Heath 2006).

Predictive data mining was used by Goodwin and Maher (2000) to test hypotheses

about premature birth, however there is no mention of testing null hypothesis. They

compared the results from 5 different modeling techniques; neural networks, logistic

regression, CART as well as purpose built software, PVRuleMiner and FactMiner to

aid in the understanding of the causes of premature birth. The purpose is to identify

patients at risk of giving birth prematurely and provide decision support for providers

of perinatal care.

Another issue with the use of data mining within medicine is the provision of data for

data mining where it is classed as a secondary use of data, that is, the data used for

mining is often not collected for research purposes, or collected for the purposes of

another clinical research study. This leads to issues of data ownership, protection of

patient privacy and confidentiality, and appropriate use of clinical information

(Harrison Jr 2008). “The ethical, legal, and social limitations on medical data mining

relate to privacy and security considerations, fear of lawsuits, and the need to balance

the expected benefits of research against any inconvenience or possible injury to the

patient.” (Cios and Moore 2002)

There are several other issues which make data mining of medical data unique (Cios

and Moore 2002; Harrison Jr 2008):

- voluminous heterogeneous data

22

- high dimensionality

- temporal patterns

- often incomplete or imprecise

- high level of noise

- missing values

- redundant, insignificant, or inconsistent data objects and/or values

- special ethical, legal, and social constraints apply to medical data

- unintuitive black box methods, like artificial neural networks, may be of less

interest (Cios and Moore 2002), as the clinicians usually would like to

understand how the result (model) was reached and would prefer

transparency (white-box) – symbolic methods

Applications of data mining in healthcare focus heavily on using predictive data

mining based on pre-defined patterns, and looking for repeating patterns in the data

stream(s) for one patient (Duchene, Garbay et al. 2007). When data mining is used

for prediction based on human-defined ‘interesting’ patterns, there is a chance of

introducing biases of the clinician/researcher working on the investigation who

determined the patterns of interest (Lavrac, Keravnou et al. 2000). Another approach

is to let the ‘data speak’, using exploratory data mining to find patterns and trends

which drive new hypotheses.

In the paper “Towards Role Based Hypothesis Evaluation for Health Data Mining”

(2006), Shillabeer and Roddick propose the use of new terminology in the area of

health data mining. They suggest that using the term rule for the results of data

mining in a health context is misleading. Instead they suggest the use of hypothesis

23

(Shillabeer and Roddick 2006). This thesis will use hypothesis as a description of the

outcome of the data mining preformed.

2.6 Temporal Abstraction

Time series appear in many different domains (Roddick and Spiliopoulou 2002),

such as finance, meteorology and medicine. The properties of time series vary in

terms of noise, volume and dimensionality. When analysing time series data, the

individual data values by themselves often provide little meaning; however, when

considered over time and context the values can become meaningful. This is

especially important in NICU settings, where it is usually trends or changes over

time, sometimes across multiple parameters, which are significant when analysing

and predicting patient conditions. When analysing time series using data mining

techniques in domains such as medicine, it is important to preserve the concept of

time and context. Preserving the time and context can be achieved by applying

temporal abstraction to the raw time series data before mining.

Temporal abstraction (TA) is a technique used to summarise time series data to a

higher level while preserving context and time. TA converts a time series into time

interval series (Azulay, Moskovitch et al. 2007), and depending on the method used,

can add qualitative information such as states and trends to a particular abstraction.

TA can be simple or complex, and complex abstractions can be done across multiple

parameters (see example on p.115, Figure 6.7). Temporal abstraction can be used to

convert a time series into a range of symbols for further analysis.

24

In medicine temporal abstraction is used to convert low level raw numeric time series

data into a higher level qualitative description which better matches the language

used by medical professionals (Stacey and McGregor 2007). Time series data from a

monitoring device attached to a baby in the NICU can be converted from a stream of

numbers to intervals labelled for example high/low/normal (state) or

increasing/decreasing/steady (trend), based on parameters set by domain experts.

Complex temporal abstractions can be created from simple abstractions from one or

more time series, relating these using Allen’s temporal relations. In his paper

Maintaining knowledge about Temporal Intervals, Allen(1983), introduces temporal

relations as a way of representing temporal patterns in time intervals. He presents

thirteen relations (Figure 2.5) that can be used to depict the relationship between time

intervals. The relations are equal, before, meets, overlaps, during, starts, finishes and

the inverse of these (except the equal relationship). Allen describes these as “a basic

set of mutually exclusive primitive relations that can hold between temporal

intervals” (Allen 1984).

Figure 2.5: The Thirteen Possible Relationships (Allen 1983; Allen 1984)

25

Shahar’s pivotal work presented a framework for Knowledge Based Temporal

Abstraction (KBTA) which infers abstractions based on domain-specific knowledge

stored in a formal knowledge base (Shahar 1997).

Some healthcare systems use temporal abstraction to abstract to the level of

descriptions or guidelines. Abstracting to this level enables the matching of these

abstractions for guideline execution in clinical management. An example of which is

the system developed by Seyfang et al (2001) for optimising oxygen supply for

newborn infants. Their system, which is part of the Asgaard framework (Shahar,

Miksch et al. 1998; Seyfang and Miksch 2004), uses the Asbru language (Seyfang

and Miksch 2004; Fuchsberger, Hunter et al. 2005), abstracts raw monitoring data

collected by NICU monitoring devices to the abstract concepts that are used in

therapeutic plans. The data enters the system as a stream and the high-level

abstractions derived from the raw data are compared to predefined conditions

described in the therapeutic plans.

RÉSUMÉ is a system which provides a “framework for deep knowledge

representation to perform temporal abstraction of patient data” (Stacey and

McGregor 2007). It uses CAPSUL, a temporal pattern representation language

(Chakravarty and Shahar 2000; Antunes and Oliveira 2001). RÉSUMÉ is used on

stored database data of low frequency of the abstracted parameters. The Tzolkin

architecture uses RÉSUMÉ to create abstractions (Boaz and Shahar 2003). RASTA

(A System for Temporal Abstraction) (O'Connor, Grosso et al. 2001) adds

distributed capabilities to RÉSUMÉ to allow the system to be used for more complex

settings (Augusto 2005).

26

In the article “PROTEMPA: A Method for Specifying and Identifying Temporal

Sequences in Retrospective Data for Patient Selection” (Post and Harrison 2007), the

authors describes the PROTEMPA method and how it has been implemented. It has

a system for implementing temporal abstractions in stored time series data, both

lower level (simple) and higher level (complex) abstractions. These abstractions are

used for identifying pre-defined patterns in the time-series data. The system has the

potential to be used in patient monitoring and decision support where the patters

being looked for are predefined, however it is not used for discovering new patterns

and relationships in the abstracted data. The temporal abstraction part of this method

could be used in the pre-processing stage of data mining for clinical research,

however the method does not include tools for mining the resulting abstractions for

new patterns and relationships. As described in Post’s doctoral thesis (2006),

"PROTEMPA is a hypothesis-testing system that scans time series data for pre-

defined mathematical and temporal patterns of interest. This strategy is in contrast to

a data mining tool that seeks to identify all meaningful patterns in a data set" (Post

2006).

Boaz and Shahar (2003) discusses the need for a temporal-abstraction database

mediator to provide a useful method for “querying not only raw data, but also its

abstractions”. In the research presented in this thesis we need to data mine the

abstractions, rather than just query them. In their paper “Idan: A Distributed

Temporal-Abstraction Mediator for Medical Databases” (Boaz and Shahar 2003) the

temporal abstraction mediator IDAN is described. IDAN uses the generic temporal

abstraction system ALMA (Boaz and Shahar 2005) for it’s temporal reasoning task,

and ALMA uses KBTA/Temporal Abstraction Rule (TAR) language (Balaban, Boaz

27

et al. 2003; Boaz, Balaban et al. 2003) and CAPSUL. IDAN is used by multiple

applications; KNAVE-II (Boaz and Shahar 2005) and DeGeL are examples.

Most of the current temporal abstraction frameworks used in healthcare creates

abstractions based on expert-defined rules. Verduijn et al. published a comparative

case study performed on intensive care monitoring data, extracting meta features to

be used for prediction (Verduijn, Sacchi et al. 2007). The study compares

abstractions created using domain knowledge and data driven abstractions, and found

that the data driven abstractions created more informative meta features. Open

research areas involve the exploration of data driven abstractions, as well as the

creation of frameworks integrating processes for temporal abstraction and data

mining that can be applied to any temporal abstraction and data mining task on time-

series data.

Recent research abstracts multidimensional time series data to produce alerts when

certain trends are detected (McGregor and Stacey 2007; Stacey, McGregor et al.

2007). Currently, the rules for detecting these trends are human-defined; however,

there may be as yet undiscovered trends and patterns that could indicate the onset of

some condition, found by analysing historical data. Opportunities exist to apply data

mining to temporally-abstracted cross correlated historical time series data of

previous NICU patients, to identify new patterns and trends that may be of

significance in the early identification of the onset of medical conditions in new

NICU patients. These trends and patterns can be used to create rules for clinical alert

systems within NICU monitoring equipment.

28

2.7 Data Mining and Temporal Abstraction

When dealing with time series data or temporal data, data mining is rarely

straightforward. Some pre-processing of the data usually needs to take place.

Antunes & Oliveira (2001) discusses some pre-processing approaches in their article

“Temporal Data Mining: an overview”, and stresses that “the representation problem

is especially important when dealing with time series, since direct manipulation of

continuous, high-dimensional data in an efficient way is extremely difficult”. Several

approaches for dealing with time-series are presented. One of the possible solutions

suggested is to use a “transformation that maps the data to a more manageable

space”. The transformation would be included in the pre-processing of the data

before data mining occurs. The paper presents several possible ways of pre-

processing the data. The approach used in this research is to use temporal abstraction

to pre-process the data before mining the abstractions created. CAPSUL and SDL are

two languages mentioned in this article as suitable to perform abstraction on time-

series data.

Temporal data mining is an important extension to data mining and is discussed in

the paper “A survey of temporal knowledge discovery paradigms and methods”

(Roddick and Spiliopoulou 2002). Many approaches to temporal data mining are

covered; however the problem of providing a flexible environment to support many

temporal data mining studies on multidimensional data streams is not discussed. The

paper discusses some interesting systems that appear to partially address the areas of

interest to the research presented in this thesis. For example, the RX project uses

temporal data to discover causal relationships. Also of interest in this paper for our

29

research is the discussion on sequence mining and SDL (Shape Definition

Language).

Duchene et al (Duchene, Garbay et al. 2007) has developed a prototype system to be

used in the area of home health telemedicine. Although this is a different

environment from the intensive care unit setting in terms of data rates and types of

data, the system they developed is of interest to this research due to the way the data

is pre-processed using temporal abstraction before being mined. The prototype

system is mining heterogeneous multivariate time-series of data for a patient to

discover and learn usual patterns for that particular patient. The purpose of the

system is to be able to detect changes in the pattern profile, which can indicate a

problem for the patient at home. In this system the focus is on one patient at a time.

The research presented in this thesis will extend this concept to mining across

multiple parameters for multiple patients to discover trends that can be indicative of

the onset on some condition.

To further assess the current state of knowledge in the area of temporal abstraction

preprocessing for data mining, a review of the literature was completed.

2.7.1 Review Method

The literature review phase focused on reviewing papers related to temporal

abstraction in IDA, with particular emphasis on abstraction of

multivariate/multidimensional data; and particularly those papers combining research

in both temporal abstraction and data mining as applied to clinical data. The

integration of DM and TA is a relatively new research area, therefore the majority of

30

papers reviewed were published in the past decade, with many written in the past

several years.

During this review searches were conducted in health informatics, medical/clinical,

computer science and engineering databases; during this effort research papers were

sourced from the following databases: Science Direct, PubMed, ACM, IEEE, and the

Web of Knowledge, as well as using Google Scholar and citation searching and

chaining. Papers were located through a number of search keywords such as

“temporal abstraction”, “temporal data mining”, “time series data mining”, “temporal

abstraction + data mining”, “integration of data mining and temporal abstraction”,

“discovering temporal association rules”, “time series analysis”, as well as

combinations of these.

2.7.2 Review Results

Table 2.1, Clinical Context, contains characteristics of the clinical environment

where the systems in each paper have been applied, as well as the characteristics of

the data discussed. For each paper the table lists the frequency of the data, whether it

is single or several parameters, real-time and/or distributed data.

Some clinical environments such as ICU deal with high frequency, high volume

clinical and physiological data from monitoring equipment (Silvent, Dojat et al.

2004; Azulay, Moskovitch et al. 2007; Moskovitch, Stopel et al. 2007; Tusch G.

2007; Verduijn, Sacchi et al. 2007), whereas others may deal with low frequency

data such as test results over time (Ho, Kawasaki et al. 2004; Abe and Yamaguchi

2005; Post and Harrison 2007).

31

Two papers reviewed considered both high and low frequency data in a multi-stream

environment (Azulay, Moskovitch et al. 2007; Verduijn, Sacchi et al. 2007), and one

of these (Verduijn, Sacchi et al. 2007) also considered real-time data. Three of the

remaining papers dealt only with high frequency multi-stream data (Silvent, Dojat et

al. 2004; Charbonnier and Gentil 2007; Moskovitch, Stopel et al. 2007), and the

remaining papers utilised low frequency data (Abe and Yamaguchi 2005; Abe 2006 ;

Post and Harrison 2007; Sacchi, Larizza et al. 2007; Tusch G. 2007). Only Bellazzi

et al. (Bellazzi, Larizza et al. 2005) is working with distributed data. With all the

papers reviewed, each of the papers had a particular study as the motivation for the

data collection, and hence, none of the papers created an environment for flexible

exploration to support different clinical research problems. The creation a flexible

environment to support different clinical research problems is an open research area.

Table 2.2, Data Mining and Temporal Abstraction Technique, records the types of

abstractions that each study/system is using, such as qualitative abstractions (states,

level shifts), or quantitative abstractions using various discretization methods as a

preprocessing step to data mining (Moskovitch, Stopel et al. 2007). The knowledge

base for the abstractions is listed as well as whether or not the system supports

complex abstractions. The last column lists the data mining technique that has been

used for the particular study/system. A variety of techniques were used for the

temporal abstractions. A data driven approach to temporal abstraction was used by

Azulay and Moskovitch (Azulay, Moskovitch et al. 2007; Moskovitch, Stopel et al.

2007). Verduijn et al (Verduijn, Sacchi et al. 2007) utilised qualitative temporal

abstractions to create state and trend abstractions. Sacchi et al (Sacchi, Larizza et al.

2007) uses Shahar’s (1997) knowledge based approach; KBTA. Four of the papers

32

discussed creation of complex abstractions (Silvent, Dojat et al. 2004; Bellazzi,

Larizza et al. 2005; Charbonnier and Gentil 2007; Post and Harrison 2007). As has

been shown, many of the temporal abstraction approaches are not data driven. The

discussion for the data driven temporal abstraction approaches are limited to the

related clinical research problem, general frameworks suitable for use for a variety of

conditions are not discussed. Alignment for condition onset prediction is also not

discussed and hence these represent open research areas.

Table 2.3, Clinical knowledge and Null Hypothesis testing, contains a column

indicating if the particular system has support for co-mining, indicating if the system

has approaches for integrating data mining and clinical reasoning, if there is support

for null hypothesis testing as discussed earlier and also indication of physician

involvement in the temporal abstraction or data mining process. Of the papers

reviewed there was none that explicitly discusses null hypothesis testing, however

one paper does discuss creating hypotheses to be evaluated (Abe and Yamaguchi

2005). As a result the incorporation of null hypothesis testing within the data mining

framework is an open research area.

Clinical Context:

# System/authors/year

Clinical Environment

Frequency Multiple streams? Real-time data? Distributed?

1 (Verduijn, Sacchi et al. 2007) Verduijn et al. 2007

ICU data 924 patients

High frequency variables measured once a minute Low frequency variables measured several times a day

High freq: mean arterial BP, central venous pressure, heart rate, TMP, FiO2, PEEP Low freq: cardiac output, base excess, creatinine kinase B and glucose value.

Yes No

2 (Tusch G. 2007) SPOT 2007

Liver transplantation followed by ICU and clinical monitoring. Note: this is presented as a

Clinical data comes in different granularities: hourly, daily, monthly, yearly

Relationship of intervals rather than single parameter values establishes the clinical concept.

Processing is not performed.

No

33

medical example, but in this paper the data is not used for processing.

3 (Azulay R. 2007) Azulay et al. 2007

ICU dataset of cardiac surgery patients (664 patients)

High frequency variables measured once a minute Low frequency variables measured once a day

2 types of data. 1. static data (age, gender, surgery type, >24 hrs ventilation). 2. temporal data with high frequency variables (6 streams) and low frequency variables (4 streams).

Processing is not performed on real-time data. Real-time data is collected for evaluating different discretization techniques.

No

4 (Moskovitch 2007) Moscovitch et al.

ICU monitoring data (664 patients)

Temporal data was measured each minute along the first 12 hours of hospitalization

2 types of data: 1. static data (age, gender, surgery type, >24 hrs ventilation). 2. temporal data with high frequency variables (6 streams) and low frequency variables (4 streams).

Processing is not performed on real-time data. Real-time data is collected for evaluating Morchen’s method as compared to Allen’s.

No

5 (Ho, Kawasaki et al. 2004) Ho et al.

Hepatitis laboratory data 771 patients

Not given, states that data is gathered at irregular intervals.

States that there are multiple variables – does not say how many

No – historical laboratory data. Considered ‘irregular temporal data’ since gathered from many lab tests over different periods spanning 20 years.

Before applying methods all related data is joined by the combined key of the patient ID and test date.

6 Abe and Yamaguchi (Abe 2005) (Abe 2006 )

Chronic hepatitis

Daily, weekly, monthly (+ randomized intervals based on patient’s statements)

Includes clinical blood and urine tests on chronic hepatitis B and C

No No

7 (Sacchi, Larizza et al. 2007) Sacchi et al. 2007

Predicting renal flares (acute episodes of illness) in lupus nephritis (a chronic autoimmune disease). Based on 228 patients (only 172 chosen)

Low frequency, based on parameters obtained from periodic (not regular) clinical monitoring.

Attempting to extract temporal rules to relate 4 parameters used for disease monitoring + one parameter for renal disease status. Average # of 9 measurements per patient with variable length of time series. Looked at variations in one or more variables.

No No

8 (Silvent, Dojat et al. 2005) Silvent et al. 2004

Weaning from mechanical ventilation – 8 patients

ECG, systemic arterial pressure and SpO2 Airway flow and pressure signals Signal acquisition at a sampling rate of 100Hz and resampled at 1Hz for temporal

Yes, at each of the 3 stages of weaning (over a 4 hour period) physiological data is recorded and a medical assistant annotates all alarming situations. Data included physiological parameters and device settings.

Yes No

34

processing. 9 (Post and Harrison

2007) Post & Harrison. 2007 PROTEMPA

HELLP syndrome (pregnancy complication in 3rd trimester) from clinical laboratory data. Data set had 761 eligible cases

Low frequency, irregular data.

Diagnosis based on 3 clinical laboratory tests

No, based on lab test results. (retrospective data)

No

10

(Bellazzi 2005) Bellazzi R et al.

Assessment of the clinical performance of haemodialysis service. Data comes from 5800 dialysis sessions from 43 patients monitored over 19 months.

Not stated. Based on haemodialysis data automatically collected during dialysis sessions. Data collected for each patient 3x week for 4 hours. Data are sequences of multidimensional time series. Based on automatic measurement of 13 variables

Yes Yes

11

(Charbonnier S 2007) Charbonnier & Gibbons 2007

Intensive care unit data collected from 8 patients collected at time of weaning from mechanical ventilation & ending of sedative drug administration.

Data recorded every second.

Yes. Article implies data is collected from range of variables monitored (obtained from monitoring devices), but does not list these devices.

Yes. Data acquisition was carried out in real-time without interference from daily care.

No.

Table 2.1: Clinical Context

Data Mining and Temporal Abstraction Technique:

System/authors/year TA Technique (algorithm) or: Temporal processing

Knowledge Base for TA defined by

Supports complex abstractions? (TA)

DM Technique (algorithm)

1 (Verduijn, Sacchi et al. 2007) Verduijn et al. 2007

Qualitative TA: concepts of state and trend are used for the abstraction (high-level description in terms of state and trend categories was derived for each time series over time intervals) Quantitative TA: derived from searching in a large space of numerical meta features..

In one case, by the experts.

Based only on basic states and trends. Not based on more complex abstractions, such as rate and acceleration.

Class probability trees (CART) used as supervised learning algorithm

2 (Tusch G. 2007) SPOT 2007

TA: using Allen’s rules (goal is to make KBTA more readily available for clinical research through the ontology)

Learned abstractions are submitted to the original database.

Not discussed. Supports R statistical DM package (open source implementation of S)

3 (Azulay R. 2007) Azulay et al. 2007

Temporal discretization (pre-processing step to DM)

Data driven

Not discussed No DM preformed, focus on pre-processing

4 (Moskovitch 2007) Moscovitch et al.

SAX: discretization

Time series are discretized based

Unsupervised TSKM mining

35

method designed for time series data (uses an approximate distance function that lower bounds the Euclidean distance). Order of values of data taken into account only in preprocessing stage. Persist: univariate discretization method for KD in time series, explicitly considers the order of the variables in the time series. Assumes that any time series comes from uniform sampling (not the case in slow domains that are sampled infrequently and manually)

on categories provided by an expert (this approach is compared to categories chosen from a data driven computational source)

method which results in a set of phrases

5 (Ho, Kawasaki et al. 2004) Ho et al.

Goal: determine small number of typical abstraction patterns that can be used to characterize most real sequences. Approach: combination of TA primitive computing with human inspection using visual tools and expert opinions

Create set of abstraction patterns viewed as a combination of TA primitives (so that each test sequence can be assigned to an abstraction pattern) Found 8 abstraction patterns for short-term changed tests and 22 for long term tests

Applied various DM methods to extract knowledge: • C5.0

and association rules in system Clementine (SPSS) • Their rule induction method LUPC and decision tree induction CARBO implemented in their system D2MS (Data Mining with Model Selection)

6 Abe and Yamaguchi (Abe 2005)

Pattern extraction involves first extracting sub-sequences and then clustering (they call this time series pattern extraction – do not use term TA)

Unclear Not discussed Developed a tool based on constructive meta-learning called CAMLET Taken PART implemented in Weka to evaluate improvement of pattern extraction algorithm.

7 (Sacchi, Larizza et al. 2007) Sacchi et al. 2007

Shahar’s KBTA

User defined Starting from TA representation, they run an algorithm for the extraction of temporal rules

Algorithm implements a search strategy based on apriori technique where the quality of a

36

expressing temporal relationships between the detected temporal patterns

rule is assessed in terms of confidence and support (described in (Bellazzi 2005) as temporal data mining)

8 (Silvent, Dojat et al. 2005) Silvent et al. 2004

Symbolic trends were computed for every parameter, with associated characteristic periods. Trend was computed using linear regression on a window whose size was determined according the dynamics of the parameter under study and called ‘characteristic span’.

Corresponds to the identification of the prior knowledge and/or extracted information necessary to drive the abstraction;

Yes. Identified two patterns that differ according to 2 thresholds on time delay and variation means, which when applied to SpO2 can be used to recognize a desaturation.

Searched for relations between complex abstractions. Identified 3 association rules (1 was evaluated by clinicians as correct). Seems like the association rule was based on only 1 parameter value

9 (Post and Harrison 2007) Post & Harrison. 2007

FW contains an algorithm source which describes how the temporal data should be processed. Low level mechanisms based on a sliding window. TAs accomplished via execution of R for Java (algorithms are written as functions in the R language)

FW contains knowledge source which defines interval relationships that define abstractions of interest. Relationships are specified as min and max temporal distances between the endpoints of participating inte

a framework for temporal abstractive multidimensional data...

Documents