a framework for temporal abstractive multidimensional data...

153
University of Western Sydney School of Computing and Mathematics A Framework for Temporal Abstractive Multidimensional Data Mining Heidi Bjering Stratti A dissertation submitted in fulfilment of the requirements of Master of Science (Hons) 2008

Upload: others

Post on 21-Nov-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

  • University of Western Sydney School of Computing and Mathematics

    A Framework for Temporal

    Abstractive Multidimensional Data

    Mining

    Heidi Bjering Stratti

    A dissertation submitted in fulfilment of the requirements of

    Master of Science (Hons)

    2008

  • Acknowledgements There are many people who deserve acknowledgement for their role in this thesis

    becoming a reality. Juggling research while working full time and being there for my

    two children has been quite a difficult balancing act which I have not always

    balanced very well. This last year has been a very difficult year for my family and

    me, and I had many moments where I was unable to see how I would manage to

    finish this research work.

    First and foremost, the person who has had the most impact in this thesis becoming a

    reality, despite being located in Canada, is my principal supervisor Dr Carolyn

    McGregor. I could not have done this without her fantastic guidance, encouragement

    and support. She got me through to the finish line during what has been one of my

    toughest years. When I was feeling at my lowest and was completely discouraged,

    she managed within an hour on Skype to get me going again. Never once a negative

    word (even when I thoroughly deserved it). I would like to thank Carolyn for

    providing me with academic guidance as well as friendship through these last couple

    of years.

    I also wish to thank my co-supervisor, Dr Mark Tracy, who came on board as my co-

    supervisor as Carolyn relocated to Canada.

    My fantastic sons, Robbie and Daniel, keep on impressing me with their take on life.

    I wish to thank them for their support and putting up with me while I have been

    doing this work.

  • I also would like to thank Darren for being by my side through this very challenging

    year, for putting up with all my stress, and for the many dinners he has cooked for

    me.

    My friend Alphia has always been there ready to listen when I needed to talk. I thank

    Alphia for her friendship, support and encouragement – and the many refreshing

    walks we have had to clear our heads.

  • Statement of Authentication

    The work presented in this thesis is, to the best of my knowledge and belief, original

    except as acknowledged in the text. I hereby declare that I have not submitted this

    material, either in full or in part, for a degree at this or any other institution.

    …………………………………………………………

    Heidi Bjering Stratti

  • i

    Table of Contents Acknowledgements ...................................................................................................... 2 Statement of Authentication......................................................................................... 4 Table of Contents .......................................................................................................... i List of Tables .............................................................................................................. iv Abstract ......................................................................................................................vii 1. Chapter 1 – Introduction ...................................................................................... 1

    1.1 Why is temporal abstractive data mining important to health and medicine2 1.2 Research motivation..................................................................................... 4 1.3 Research aims and objectives....................................................................... 5 1.4 Contribution to knowledge........................................................................... 6 1.5 Research method .......................................................................................... 7 1.6 Thesis overview ........................................................................................... 8

    2. Chapter 2 – Literature Review ........................................................................... 10 2.1 Introduction................................................................................................ 10 2.2 Knowledge Discovery in Data ................................................................... 11 2.3 KDD and Intelligent Data Analysis ........................................................... 16 2.4 Data Mining ............................................................................................... 17 2.5 Application of Data Mining within Medicine............................................ 19 2.6 Temporal Abstraction ................................................................................ 23 2.7 Data Mining and Temporal Abstraction .................................................... 28

    2.7.1 Review Method .................................................................................. 29 2.7.2 Review Results................................................................................... 30

    2.8 Conclusions and impact on future research ............................................... 38 3. Chapter 3 – The NICU environment.................................................................. 39

    3.1 The Australian Context .............................................................................. 39 3.2 Admittance to NICU .................................................................................. 40 3.3 Medical Devices and NICU monitoring .................................................... 42 3.4 Alerts .......................................................................................................... 45 3.5 Known Physiological Onset Predictors...................................................... 46 3.6 Similarities between NICU and other environments ................................. 47

    4. Chapter 4 – Solution Manager Service and e-Baby Architecture ...................... 49 4.1 e-Baby Architecture ................................................................................... 50 4.2 Solution Manager Service .......................................................................... 53 4.3 Conclusion and Implication on this Research............................................ 57

    5. Chapter 5 – Design of methodology for multidimensional TA DM framework (TAMDDM)............................................................................................................... 59

    5.1 Multi Agent System ................................................................................... 61 5.1.1 Database Access Server ..................................................................... 65 5.1.2 Processing Agent................................................................................ 66 5.1.3 Temporal Agent ................................................................................. 67 5.1.4 Relative Agent.................................................................................... 68 5.1.5 Functional Agent................................................................................ 70 5.1.6 Rules Generating Agent ..................................................................... 71

    5.2 Extended CRISP-DM Model ..................................................................... 72 5.2.1 Data Understanding............................................................................ 75 5.2.2 Data Preparation................................................................................. 76 5.2.3 Modeling ............................................................................................ 77

  • ii

    5.2.4 Modelling: DM Rule-Set Generation and Select Significant Rule-Set 77 5.2.5 Modelling: Formulate Null Hypothesis ............................................. 78 5.2.6 Modelling: Run Statistical Processes to test Null Hypothesis ........... 79 5.2.7 Evaluation: Load Accepted Rule-Sets into IDSS............................... 80

    5.3 TAMDDM Framework Tasks.................................................................... 81 5.3.1 Local Collection and Cleanup............................................................ 81 5.3.2 Temporal Abstractions, Simple & Complex, Multi-Stream .............. 82 5.3.3 Relative Alignment ............................................................................ 83 5.3.4 Exploratory Data Mining across multiple streams for multiple patients 85 5.3.5 Confirmatory Data Mining with Null Hypothesis ............................. 87 5.3.6 Hypothesis/Rule generation ............................................................... 88

    5.4 Data storage................................................................................................ 89 5.4.1 Clinical Data ...................................................................................... 89 5.4.2 Diagnosis............................................................................................ 90 5.4.3 Patient Diagnoses ............................................................................... 90 5.4.4 Physiological Data ............................................................................. 91 5.4.5 Physiological Parameter..................................................................... 92 5.4.6 Temporal Rules .................................................................................. 92 5.4.7 Temporal Data.................................................................................... 93 5.4.8 Relative Rule...................................................................................... 94 5.4.9 Relative Temporal Abstractions......................................................... 95 5.4.10 RuleBase Data.................................................................................... 96

    5.5 Conclusion ................................................................................................. 96 6. Chapter 6 – Demonstration within the NICU context........................................ 98

    6.1 Processing Agent...................................................................................... 100 6.1.1 Mapping external source data to DBAS data structure.................... 103

    6.1.1.1 Mapping external source table Baby to DBAS table Baby:......... 103 6.1.1.2 Mapping external source table Diagnosis to DBAS table BabyConditions............................................................................................ 104 6.1.1.3 Mapping external source table Condition to DBAS table Conditions 105 6.1.1.4 Mapping external source flat file physiologicalData to DBAS table PhysData 105

    6.1.2 Mapping from DBAS data structure to internal TAMDDM data stores 106

    6.1.2.1 Mapping DBAS table Baby to TAMDDM table Patient ............. 107 6.1.2.2 Mapping DBAS table PhysData to TAMDDM table PatientPhysiologicalParameter..................................................................... 107 6.1.2.3 Mapping DBAS table MFC to TAMDDM table PhysiologicalParameter................................................................................ 107 6.1.2.4 Mapping DBAS table BabyCondition to TAMDDM table PatientDiagnosis........................................................................................... 108 6.1.2.5 Mapping DBAS table Conditions to TAMDDM table Diagnosis108

    6.2 Temporal Agent ....................................................................................... 108 6.3 Relative Agent.......................................................................................... 116 6.4 Functional Agent...................................................................................... 121 6.5 Rules Generating Agent ........................................................................... 123 6.6 Future Research Application.................................................................... 123

  • iii

    6.7 Issues associated with Case Study ........................................................... 125 6.8 Impact of Demonstration.......................................................................... 125 6.9 Conclusion ............................................................................................... 126

    7. Chapter 7 – Conclusion.................................................................................... 127 7.1 Summary .................................................................................................. 127 7.2 Contributions............................................................................................ 130 7.3 Limitations and further research .............................................................. 131 7.4 Impact....................................................................................................... 132 7.5 Conclusion ............................................................................................... 133

    References ................................................................................................................ 134

  • iv

    List of Tables Table 2.1: Clinical Context ........................................................................................ 34 Table 2.2: Data Mining and Temporal Abstraction Technique ................................. 37 Table 2.3: Clinical knowledge and Null Hypothesis testing...................................... 38 Table 3.1: NSW/ACT Hospital Classification of NICU Services ............................. 40 Table 3.2: Apgar score ............................................................................................... 41 Table 3.3: Component Monitoring System (CMS) data types................................... 46 Table 6.1: Mapping of Baby table (source) to Baby table (DBAS)......................... 104 Table 6.2: Mapping of Diagnosis table (source) to BabyConditions table (DBAS) 105 Table 6.3: Mapping of Condition table (source) to Conditions table (DBAS) ........ 105 Table 6.4: Mapping of physiological data in flat file (source) to PhysData table (DBAS) .................................................................................................................... 106 Table 6.5: Mapping of Baby table (DBAS) to Baby table (TAMDDM)................. 107 Table 6.6: Mapping of physiological data PhysData table (DBAS) to PatientPhysiologicalParameter table (TAMDDM).................................................. 107 Table 6.7: Mapping of MFC table (DBAS) to PhysiologicalParameter table (TAMDDM)............................................................................................................. 107 Table 6.8: Mapping of BabyCondition (DBAS) to PatientDiagnosis (TAMDDM) 108 Table 6.9: Mapping of Conditions table (DBAS) to Diagnosis table (TAMDDM) 108 Table 6.10: Raw SaO2 readings............................................................................... 110 Table 6.11: Abstractions created from all SaO2 readings in Table 6.10 ................. 112 Table 6.12: Blood pressure values ........................................................................... 112 Table 6.13: Abstractions created from all blood pressure readings in Table 6.12... 113 Table 6.14: Complex Abstractions.......................................................................... 115 Table 6.15: Temporal Abstractions using absolute start and end times................... 119 Table 6.16: relative temporal abstractions of abstractions from Table 6.11, using diagnosis time of 20061116_13:29:02.001 as the event of interest for realignment of the patients data........................................................................................................ 120

  • v

    List of Figures Figure 1.1: Elements of constructive research ............................................................. 8 Figure 2.1: Overview of the steps constituting the KDD process (Fayyad, Piatetsky-Shapiro et al. 1996) .................................................................................................... 12 Figure 2.2: The phases of the CRISP-DM reference model (CRISP-DM 2000)....... 14 Figure 2.3: The six-step DMKD process model (Cios and Moore 2002).................. 15 Figure 2.4: Relationship between clinical research and clinical management........... 17 Figure 2.5: The Thirteen Possible Relationships (Allen 1983; Allen 1984).............. 24 Figure 3.1: NICU physiological monitor (Neonatology on the Web) ....................... 43 Figure 4.1: The e-Baby Architecture (McGregor, Heath et al. 2005)........................ 52 Figure 4.2: Solution Manager Service ....................................................................... 56 Figure 5.1: The TAMDDM Framework .................................................................... 60 Figure 5.2: Solution Manager Service (SMS)............................................................ 62 Figure 5.3: Multi-agent framework (Foster 2008) ..................................................... 62 Figure 5.4: The extended multi agent system. ........................................................... 64 Figure 5.5: The multi agent layer in the TAMDDM framework ............................... 65 Figure 5.6: Processing Agent ..................................................................................... 66 Figure 5.7: Temporal Agent....................................................................................... 67 Figure 5.8: Relative Agent ......................................................................................... 68 Figure 5.9: Functional Agent ..................................................................................... 70 Figure 5.10: Rules Generating Agent ........................................................................ 71 Figure 5.11: Parallelism between CRISP-DM and the Scientific Method (Heath 2006) .......................................................................................................................... 73 Figure 5.12: CRISP-DM extended for Clinical Practice and Research as proposed by Heath‘s thesis (2006) ................................................................................................. 74 Figure 5.13: Extended CRISP-DM model in the TAMDDM framework ................. 75 Figure 5.14: Data Understanding ............................................................................... 75 Figure 5.15: Data Preparation .................................................................................... 76 Figure 5.16: Modelling: DM Rule-set Generation and Select Significant Rule-set phases ......................................................................................................................... 77 Figure 5.17: Formulate Null Hypothesis phase ........................................................ 78 Figure 5.18: Run Statistical Processes to test Null Hypothesis ................................. 79 Figure 5.19: Load Accepted Rule-sets into IDSS phase ............................................ 80 Figure 5.20: Local Collection and Cleanup ............................................................... 81 Figure 5.21: Temporal Abstractions .......................................................................... 82 Figure 5.22: Relative Alignment................................................................................ 83 Figure 5.23: Data before relative alignment .............................................................. 84 Figure 5.24: Data after relative alignment ................................................................. 85 Figure 5.25: Exploratory Data Mining....................................................................... 85 Figure 5.26: Confirmatory Data Mining with Null Hypothesis................................ 87 Figure 5.27: Hypothesis/Rule generation.................................................................. 88 Figure 5.28: TAMDDM data store ............................................................................ 89 Figure 5.29: Patient Table .......................................................................................... 89 Figure 5.30: Diagnosis Table ..................................................................................... 90 Figure 5.31: PatientDiagnosis Table .......................................................................... 91 Figure 5.32: PatientPhysiologicalParameter Table .................................................... 91 Figure 5.33: PhysiologicalParameter Table ............................................................... 92 Figure 5.34: TA_Rule Table ...................................................................................... 92 Figure 5.35: TemporalAbstraction Table................................................................... 93

  • vi

    Figure 5.36: Study Table............................................................................................ 94 Figure 5.37: TA_RelativeTime Table ........................................................................ 95 Figure 6.1: Result database structure passed from DBAS to agents in the multi-agent system (Foster 2008) ................................................................................................ 101 Figure 6.2: Structure of the source data regarding babies........................................ 102 Figure 6.3: Structure of the source physiological data............................................. 103 Figure 6.4: modified BabyCondtions table .............................................................. 104 Figure 6.5: Graphing of SaO2 values against a threshold of 90%........................... 111 Figure 6.6: Graphing of blood pressure values against a threshold of 24mm/Hg ... 113 Figure 6.7: Complex Abstraction............................................................................. 115 Figure 6.8: realignment of abstracted parameters relative to diagnosis................... 117

  • vii

    Abstract

    In the industrialised world, premature birth has been recognised as one of the most

    significant perinatal health issues (Kramer, Platt et al. 1998). In Australia 8.1% of

    babies are born before 37 weeks gestation (Laws, Abeywardana et al. 2007).

    Premature babies often have prolonged stays in Neonatal Intensive Care Units

    (NICUs) and can suffer from a number of different conditions during their stay.

    Some of these conditions have been shown to exhibit certain variations in their

    physiological parameters that can indicate the onset of such conditions, before it can

    be detected by other means. Medical monitoring equipment produces large masses of

    data, which makes analysing this data manually impossible. Adding to the

    complexity of the large datasets is the nature of physiological monitoring data – the

    data is multidimensional, where it is not only changes in individual dimensions that

    are significant, but sometimes simultaneous changes in several dimensions. As the

    time-series produced by the monitoring equipment is temporal, there is a need for

    clinical research frameworks that enables both the dimensionality and temporal

    behaviour to be preserved during data mining.

    The aim of this research is to extend previous research that proposed a framework to

    support analysis and trend detection in historical data from Neonatal Intensive Care

    Unit (NICU) patients. The extensions contribute to fundamental data mining

    framework research through the integration of temporal abstraction and support of

    null hypothesis testing within the data mining processes. The application of this new

    data mining approach is the analysis of level shifts and trends in historical temporal

    data and to cross correlate data mining findings across multiple data streams for

  • viii

    multiple neonatal intensive care patients in an attempt to discover new hypotheses

    indicative of the onset of some condition. These hypotheses can then be evaluated

    and defined as rules to be applied in the monitoring of neonates in real-time to enable

    early detection of possible onset of conditions. This can assist in faster decision

    making which in turn may avoid conditions developing into serious problems where

    treatment may be futile.

    This research employs a constructive research method. In this research, the problem

    is the inability of current data mining frameworks to completely support clinical

    research in multidimensional temporal data. This research has resulted in the design

    of a temporal abstraction multidimensional data mining (TAMDDM) framework

    suitable for clinical research in multidimensional temporal time series data. The

    framework is demonstrated through a case study with neonatal intensive care

    monitoring data.

  • 1

    1. Chapter 1 – Introduction This thesis presents a framework to enable multi-dimensional data mining on time

    series data that exhibits temporal behaviours. This research is demonstrated through

    a case study utilising physiological time series data streams, together with other

    clinical data streams collected from patients in Neonatal Intensive Care Units

    (NICUs) for the detection of trends and patterns in multi-dimensional stored

    physiological data. The purpose of detecting these trends and patterns is to recognise

    indicators for the onset of some condition in the neonate, to enable the creation of

    hypotheses that can be transformed into rules suitable for use in intelligent

    monitoring systems.

    The large volumes of data present in medical data renders traditional manual data

    analysis inadequate (Lavrac 1999). Data mining is the process of analysing large

    amounts of data to find new patterns and relationships within the data, by using

    techniques such as statistics, machine learning and pattern recognition. Multi-

    dimensional data is data that consist of more than one variable. In physiological data

    streams, multi-dimensional data means that for one patient there are values for

    several different data streams, for example arterial oxygen saturation AND blood

    pressure. Each of these data streams separately represents a single dimension. When

    combined together the data becomes multi-dimensional. Temporal abstraction (TA)

    is a technique used to summarise time series data to a higher level while preserving

    context and time, usually adding qualitative information such as states and trends to a

    particular abstraction. Sections of data which holds true for certain criteria, such as

    high (or low), can be summarised with a start time and end time for when this criteria

    is true, as well as a label (high or low) to describe the abstraction. This is particularly

  • 2

    relevant for a NICU setting, where it is usually trends or changes in the physiological

    data over time, sometimes across multiple parameters, which are significant when

    analysing and predicting patient conditions.

    The remainder of this chapter is structured as follows; first temporal abstractive data

    mining and its importance to health and medicine is discussed, before the motivation

    for the research presented in this thesis is described. The discussion on motivation

    for this research leads to the section outlining the research aims and objectives by

    listing the research hypotheses for this thesis, before listing the contributions to

    knowledge that have resulted from this work. The research method utilised for this

    research is presented, before the overview of the content of this thesis concludes the

    chapter.

    1.1 Why is temporal abstractive data mining important to health and medicine

    The medical industry generates large amounts of medical data, of which very little is

    used for extracting useful information (Hanson 2006). For example, modern

    equipment in intensive care units (ICUs) produce vast amount of data from monitors

    connected to patients (Horn 2001). The situation is the same in NICUs where the

    focus of this research lies. Babies in a NICU usually have a range of monitoring and

    life support devices attached to them. These devices produce large amounts of data

    which can be essential in deciding on treatment options. However, the amount of

    data makes it very hard for the neonatologist to extract useful information. The data

    is displayed on monitors, often in waveforms, and a static ‘picture’ is recorded by

    staff at regular intervals. Between each manual recording there can be small changes

    to the data which is never noted. These changes may be important in being able to

    predict the onset of some condition. Sepsis, a common illness in neonates, has been

  • 3

    shown to exhibit changes in physiological data before the condition can be diagnosed

    through blood cultures (Griffin and Moorman 2001; Griffin, O'Shea et al. 2003;

    Griffin, O'Shea et al. 2004; Griffin, Lake et al. 2005; Griffin, Lake et al. 2007). This

    indicates that subtle changes that may not be apparent through the current practice of

    manual recording of the physiological data at regular intervals can be important in

    detecting the onset of condition in neonates. There could also be indicators which

    exhibit across multiple parameters that together indicate the onset of some condition.

    This situation has created the need for systems aimed at clinical management to help

    analyse the data produced by the monitoring and life support devices connected to

    the babies. These systems look for certain trends or patterns that have previously

    been defined, often by experts in the field. However, “human-defined rules risk

    capturing the biases of one expert” (Lavrac, Keravnou et al. 2000), therefore clinical

    research on historical data is needed to find new previously undiscovered trends and

    patterns. Modern technology allows this data to be used as input in processes to

    attempt to derive information and new knowledge, known as knowledge discovery in

    data (KDD).

    “The gap between data generation and data comprehension is widening in all

    fields of human activity. In medicine, overcoming this gap is particularly

    crucial since medical decision making needs to be supported by arguments

    based on basic medical knowledge as well as knowledge, regularities and

    trends extracted from data.” (Lavrac, Keravnou et al. 2000)

    There is a need for frameworks to facilitate clinical research on stored historical

    physiological patient monitoring data to enable the discovery of previously unknown

    trends and patterns that may be indicative of the onset of some condition. Research in

    this kind of data brings challenges: the volume of the data is massive; the data is

    multidimensional, as for each patient there are several parameters which can not be

  • 4

    considered in isolation from each other; the data is temporal, which means that

    individual data values do not necessarily provide much meaning, however when

    these values are considered in relation to their neighbouring values they can provide

    meaning. Therefore each individual value must be considered in both time and

    context. Opportunities exist for the exploration of data to determine the existence of

    pre onset behaviours for multiple conditions and diagnosis.

    1.2 Research motivation

    As will be shown in the literature review in chapter two, there is an absence of

    flexible multidimensional approaches to data mining of time series data. Chapter two

    also discusses the need for representation of temporal behaviour in this type of data

    to enable this temporal behaviour to be preserved in the mining process, as often

    individual data values by themselves do not have much meaning, however when

    considered over time and context meaning can be derived.

    Current monitoring systems used in NICUs are not capable of monitoring cross

    correlated temporal rules in multiple data streams. Current research by Stacey and

    McGregor (McGregor and Stacey 2007; Stacey, McGregor et al. 2007) in this area is

    showing the possibilities of such systems becoming a reality. For these systems to be

    effective, cross correlated rules for multiple parameters must be created for use by

    the alarming component of the system. Currently such rules are created by domain

    experts, however there is potential for yet undiscovered rules to be hiding in the vast

    amounts of data produced by the monitoring equipment connected to the patient. To

    enable creation of hypotheses that can be turned into such rules, there is need for a

    framework that enables clinical research in this type of multi-dimensional temporal

  • 5

    data, where the rigour necessary for medical research, including null-hypothesis

    testing, can be accommodated.

    With a reduction in storage cost and a corresponding increase in the ability to collect

    and store temporal data through real-time clinical monitoring, there comes the

    opportunity to analyse collected data along time (Moskovitch 2007). This is

    especially significant in clinical environments, where individual data elements in

    clinical records may not be meaningful outside of a particular temporal context.

    However, when considered over time and context, the values and their inter-

    relationships can become significant. This is particularly true in acute care settings,

    such as neonatal care, where it is usually trends or changes over time, sometimes

    across multiple parameters, which are significant when predicting the onset of future

    patient conditions. As an added complexity, patients who have the same condition

    may have substantially different types and timing of observations, unlike retail data

    which generally have comparable complements of data elements obtained at similar

    times (Harrison Jr 2008).

    1.3 Research aims and objectives

    The issues of clinical research in physiological data together with a review of the

    current functionality of data mining introduced in the two previous sections led to the

    creation of 5 research hypotheses. The research hypotheses of this thesis are that:

    1. A multidimensional data mining (MDDM) framework can be defined for

    clinical research to discover trends and patterns indicative of the onset of

    some condition.

  • 6

    2. The abovementioned framework will include methods for applying temporal

    abstraction (TA) across multiple parameters for multiple patients to enable

    mining of multi-dimensional temporal data

    3. The TAMDDM framework can be applied in a neonatal context

    4. The TAMDDM framework can support null hypothesis testing

    5. The hypotheses generated by the framework can be used by a real-time event

    stream processor analysing the current condition of babies in a Neonatal

    Intensive Care Unit (NICU)

    1.4 Contribution to knowledge

    The areas of research contribution to knowledge resulting from this thesis are:

    • Extensions to a multi agent framework previously designed for analysing time series data, to facilitate temporal abstraction and realignment of

    these abstractions (as presented in chapter 5)

    • Enable incorporation of the extended CRISP-DM model within the multi agent framework (as presented in chapter 5).

    • Design of a framework to enable temporally abstractive multi dimensional data mining (as presented in chapter 5)

    • Enhancement of the interaction between clinical research and clinical management by generating a framework for clinical research which can

    produce hypotheses that will feed into intelligent monitoring systems used

    in clinical management. The clinical research framework uses as input

    data from the various monitoring equipment used in clinical management

    processes (as presented in chapter 5 and chapter 6)

  • 7

    1.5 Research method

    For this research a constructive research method is used. This is a research method

    widely used in computer science disciplines, information systems, management

    accounting and medical domains (Kasanen 1993; Curry 2000; Shaw 2001; Shapiro

    2003). The key in constructive research is the development of a new construct in

    response to an explicit problem (Lassenius, Soininen et al. 2001). The construct,

    which can be a new model, software, theory, framework or algorithm, is tested for

    usability and enables theoretical conclusions to be made. The aim is to take a real

    world practical problem and produce a real world solution.

    Lassenius et al (2001) describes 6 phases of the constructive research method. These

    phases can be iterative and recursive:

    1. Find a practically relevant problem

    2. Obtain an understanding of the topic and the problem

    3. Innovate, i.e., construct a solution idea

    4. Demonstrate that the solution works

    5. Show theoretical connections and research contribution

    6. Examine the scope of applicability

    For this research the practically relevant problem is the need to be able to use stored

    clinical data across multiple parameters to identify trends and level shifts that can be

    relevant in early recognition of problems for patients, in this case neonates.

    Obtaining an understanding of the topic and the problem is achieved by reviewing

    literature in the area of knowledge discovery in databases, data mining in medicine

  • 8

    and temporal abstraction. Designing a framework for the temporal abstraction and

    data mining of multidimensional parameters completes the phase of constructing a

    solution idea. Applying the framework tasks to examples from a neonatal intensive

    care unit (NICU) demonstrates that the solution can be applied in a NICU setting.

    Finally the thesis will show the theoretical connections and research contributions

    made, as well as examine the scope of applicability in other areas.

    CONSTRUCTION,

    problem solving

    Practical relevance

    Theory connection

    Practical functioning

    Theoretical contribution

    Figure 1.1: Elements of constructive research

    The goal of constructive research, as described by Lassenius et al (Lassenius,

    Soininen et al. 2001) is to produce “Innovative constructs, intended to solve

    problems faced in the real world and, by that means, to make a contribution to the

    theory of the discipline in which it is applied” (Lassenius, Soininen et al. 2001)

    1.6 Thesis overview

    Chapter 2 presents a literature review of the areas of influence for this thesis, mainly

    knowledge discovery in databases (KDD), data mining and temporal abstraction. The

    chapter explores these areas in their application to medical systems in particular, to

    expose the open health informatics research areas that led to the formation of the

    research hypotheses addressed by the techniques proposed in this research. In chapter

    3, further background is provided by describing the neonatal intensive care unit

  • 9

    (NICU) environment which provides the setting for the Temporal Abstractive

    MultiDimensional Data Mining (TAMDDM) framework designed in this thesis.

    Before describing the design of the TAMDDM framework, chapter 4 gives an

    introduction to the previous research that this research builds on, the e-Baby project

    and the solution manager service (SMS), which contains the analytical processor

    where the TAMDDM framework will reside. The TAMDDM framework

    components, including the multi-agent system and its extensions, the extended

    CRISP-DM model and the TAMDDM tasks are described. Chapter 6 demonstrates

    how the TAMDDM framework can be used for conducting clinical research within

    the NICU domain. Chapter 7 provides the conclusion to the thesis, summarising the

    contributions of this work and provides direction for future research.

  • 10

    2. Chapter 2 – Literature Review

    2.1 Introduction

    The main motivation for the research in this thesis is the identified gap that exists

    between clinical management and clinical research (Foster, McGregor et al. 2005).

    This research is particularly interested in the intensive care unit (ICU) environment

    where observation of the patient’s condition is supported through the provision of

    several physiological time series data streams via medical monitors such as heart rate

    and mean blood pressure. There exists the possibility of discovering new knowledge

    that may exist in patient data, which can indicate the onset of some condition such as

    sepsis (a blood poisoning condition).

    Patient data, in particular monitoring data, is inherently temporal. Based on the

    research hypotheses presented in Chapter 1, this literature review chapter will mainly

    focus on temporal abstraction and data mining, which are part of the overall

    knowledge discovery in data (KDD) process. Data mining is the chosen discovery

    technique, and as patient data, in particular monitoring data, is naturally temporal, a

    technique is needed to preserve the temporal aspect of the data when it is mined.

    Therefore temporal abstraction forms part of the preprocessing step.

    The remainder of this chapter is structured as follows. First the area of Knowledge

    Discovery in Data is introduced. The relationship between Knowledge Discovery in

    Data and Intelligent Data Analysis is presented. The data mining component within

    Knowledge Discovery in Data is defined and the issues exposed by other researchers

    relating to the application of Data Mining within the domain of medicine are

    discussed. The concept of temporal abstraction is introduced. Previous researcher’s

  • 11

    use of temporal abstraction to support the preprocessing within data mining is

    presented and analysed to uncover open research areas that led to the creation of the

    research hypotheses for this thesis.

    2.2 Knowledge Discovery in Data

    Although the area of medical informatics is a relatively new research area,

    application of analytical techniques in medicine has a long history. In the mid

    eighteen hundreds researchers collected data and looked for patterns in an attempt to

    provide some causal understanding of the pandemic in London at the time. (Brown

    2008)

    In current time, the ability to gather and store data is steadily increasing, resulting in

    “data rich times” (Brown 2008). Efficient techniques are needed to enable the

    analysing and understanding of these resources. The possibility of the discovery of

    hidden knowledge in this massive amount of data is driving the development of the

    field of Knowledge Discovery in Databases (KDD) and data mining. Fayyad and

    Stolorz (1997) define data mining as “the application of specific algorithms for

    extracting patterns from data”. To enable knowledge to be derived from the data, a

    larger process needs to envelope the data mining step. This process includes data

    preparation and selection, cleaning and preprocessing of the data, including any prior

    knowledge about the data into the process, as well as interpretation of the mining

    results. KDD refers to this overall process of discovering knowledge in data (Figure

    2.1) where data mining is one of the steps, using specialised tools for this task

    (Holmes and Peek 2007). Holmes and Peek (2007) states that KDD is a “process

    where the data are used for hypothesis generation”, and provides a framework to

    avoid “fishing expeditions”, also called data dredging.

  • 12

    Figure 2.1: Overview of the steps constituting the KDD process (Fayyad, Piatetsky-Shapiro et al.

    1996)

    With the large amounts of data collected by various applications and processes,

    manual data analysis and interpretation is not feasible, can be very subjective and is

    impractical as data volumes grow exponentially (Fayyad, Piatetsky-Shapiro et al.

    1996). As Fayyad et al (1996) points out, “the true value of such data lies in the

    users’ ability to extract useful reports, spot interesting events and trends, support

    decisions and policy based on statistical analysis and inference, and exploit the data

    to achieve business, operational, or scientific goals”. These user goals were a driver

    behind the development of the KDD research area in the late eighties and early

    nineties. The term KDD was conceived in 1989 to “emphasize that ‘knowledge’ is

    the end product of a data-driven discovery” (Fayyad and Stolorz 1997). Knowledge

    Discovery in Databases (KDD) is sometimes also referred to as Knowledge

    Discovery in Data (Heath 2006).

    Fayyad et al (1996) defines the KDD process as: “The nontrivial process of

    identifying valid, novel, potentially useful, and ultimately understandable patterns in

    data.” Here, pattern is referring to a subset of the data, or a model that is relevant to

    that subset and should be valid for new data, and process is indicating several

    iterative and interactive steps; data preparation, search for patterns, knowledge

    evaluation, refinement.

  • 13

    Fayyad et al (1996) define 9 steps in the KDD process:

    1. Learning the application domain

    2. Creating a target dataset

    3. Data cleaning and preprocessing

    4. Data reduction and projection

    5. Choosing the function of data mining

    6. Choosing the data mining algorithm(s)

    7. Data mining

    8. Interpretation

    9. Using discovered knowledge

    In their book Data mining: A knowledge discovery approach (Cios, Swiniarski et al.

    2007), the authors stress that knowledge discovery focuses on the whole process,

    including before and after modeling, rather than just the data mining modeling part.

    They call this the Knowledge Discovery Process (KDP), “a process that seeks new

    knowledge about an application domain” (Cios, Swiniarski et al. 2007). They cite

    Fayyad’s 9 step KDD model above as the leading research model for KDD, and the

    CRISP-DM (Cross-Industry Standard Process for Data Mining) as the leading

    industrial model. The CRISP-DM model consists of six phases (Figure 2.2):

    1. Business Understanding

    2. Data understanding

    3. Data preparation

    4. Modeling

    5. Evaluation

    6. Deployment

  • 14

    Figure 2.2: The phases of the CRISP-DM reference model (CRISP-DM 2000)

    As Figure 2.2 illustrates through the outer circle with clockwise arrows, the CRISP-

    DM model is iterative. A phase consists of four layers, namely:

    1. Phase

    2. Generic tasks

    3. Specialised tasks

    4. Process instance

    Generic tasks are tasks that should be done for any data mining circumstance. The

    specialised tasks illustrate how the generic tasks should be done in a particular

    situation, while the process instance level records what occurred in a particular

    deployment of a particular phase of the process model (CRISP-DM 2000).

  • 15

    CRISP-DM was developed in 1996, and the goal was to be “industry-, tool- and

    application-neutral” (CRISP-DM 2000). A step-by-step data mining guide is

    available at http://www.crisp-dm.org. The guide provides instructions for each level

    of task within each phase (CRISP-DM 2000).

    Cios and Moore (2002) discuss the six step DMKD process model (Figure 2.3)

    which is an extension to the CRISP-DM, and is described in a later book on data

    mining and knowledge discovery (Cios, Swiniarski et al. 2007) as a hybrid model,

    combining facets from the CRISP-DM model and academic models.

    Figure 2.3: The six-step DMKD process model (Cios and Moore 2002)

    Cios et al (2007) present a comparison table comparing 5 KDD models. The

    comparison demonstrates that although the number and names of steps in each model

    varies, they all include understanding of the domain, preprocessing of the data, data

  • 16

    mining and evaluation. Data preparation is identified in all the models as the most

    time consuming part of any KDD model.

    KDD systems draws on research from a variety of fields, some of which are

    databases, machine learning, pattern recognition, statistics, artificial intelligence and

    reasoning, data visualisation and high performance computing (Fayyad, Piatetsky-

    Shapiro et al. 1996). It is beyond the scope of this thesis to cover all the areas of

    KDD. The main focus of this thesis is in the area of data preprocessing and data

    mining, two of the steps in the overall process of KDD.

    2.3 KDD and Intelligent Data Analysis

    Some authors use the term Intelligent Data Analysis (IDA) when discussing the

    KDD processes. According to Stacey and McGregor (2007), “KDD is primarily

    concerned with learning new knowledge whereas IDA is directed toward application

    of knowledge for data interpretation”. That is, IDA is utilising existing knowledge to

    look for the existence of instances of that knowledge and take action for those

    instances. For the research presented in this thesis, the primary interest is in

    developing a framework for aiding the discovery of new knowledge, integrating the

    benefits of temporal abstraction and data mining in this discovery process. However

    it is also a goal that this new knowledge can be utilised in the intelligent data analysis

    of real time streaming patient data, therefore any rules developed from hypotheses

    created by the framework described in chapter 5 should feed back into the IDA

    process described by Stacey et al (McGregor and Stacey 2007; Stacey, McGregor et

    al. 2007), as depicted in Figure 2.4 below, showing the relationship between clinical

    research and clinical management.

  • 17

    Stored

    physiological &

    clinical data

    Monitoring devices

    Clinical ResearchClinical Management

    Physiological dataPhysiological data

    Data validation

    Temporal abstraction

    Inference engine

    Alerts

    Temporal abstraction

    Stored

    abstractions

    Temporal

    Abstraction

    Rules

    Data mining

    Rule

    creation

    Hypotheses

    Domain knowledge

    Inference

    Rules

    Alert

    Rules

    Rules

    Figure 2.4: Relationship between clinical research and clinical management

    2.4 Data Mining

    One of the common parts in the various KDD processes discussed in the previous

    section is the area of data mining. Data mining is the activity used in KDD for the

    actual discovery of patterns and relationships in data. “Data mining involves fitting

    models to or determining patters from observed data. The fitted models play the role

    of inferred knowledge. Deciding whether or not the models reflect useful knowledge

    is part of the overall interactive KDD process for which subjective human judgement

    is usually required” (Fayyad, Piatetsky-Shapiro et al. 1996).

    Data mining is used for many different problem types. The CRISP-DM data mining

    guide (CRISP-DM 2000) offers the following list of problem types:

    - Data description and summarisation

    - Segmentation

  • 18

    - Concept descriptions

    - Classification

    - Prediction

    - Dependency analysis

    A variety of techniques are available to solve the different types of problems (CRISP-DM 2000):

    - Clustering techniques

    - Neural nets

    - Visualization techniques

    - Rule induction methods

    - Conceptual clustering

    - Discriminant analysis

    - Rule induction methods

    - Decision tree learning

    - K-nearest neighbor

    - Case-based reasoning

    - Genetic algorithms

    - Regression analysis

    - Regression trees

    - Box-Jenkins methods

    - Correlation analysis

    - Association rules

    - Bayesian networks

    - Inductive logic programming

  • 19

    Several techniques can be suitable for a particular type of problem (i.e., discriminant

    analysis, rule induction methods, decision tree learning, neural nets, k nearest

    neighbor, case-based reasoning, and genetic algorithms could all be appropriate

    techniques to use for a classification problem (CRISP-DM 2000)). Similarly, a

    particular technique can be suitable for more than one type of data mining problem.

    Neural nets for example, is a suitable technique for segmentation, classification and

    prediction problems (CRISP-DM 2000).

    Data mining techniques are often described as supervised or unsupervised learning

    (Bath 2004). Supervised learning involves using a training set of the data that

    includes the ‘answer’ to the classification, before using a test data set without the

    ‘answer’ to be classified. Unsupervised learning gives no information to the data

    mining tool about the classification; the data mining technique used creates the

    classifications based on the data it is exposed to.

    Cios and Moore (2002) considers data mining as a superset of statistics, as data

    mining as well as drawing from statistics, also uses concepts from several other

    disciplines, such as machine learning and database technology.

    2.5 Application of Data Mining within Medicine

    Heath (2006) argues that for KDD and data mining results to be accepted by

    clinicians and the medical community, adaption must be made to introduce more

    rigor in the form of scientific-method approach into the process, and to include

    provisions for hypothesis creation and null hypothesis testing within the framework.

    According to Heath (2006), clinicians are sceptical of DM results, largely because

  • 20

    current frameworks do not support the scientific method of Null Hypothesis testing.

    The null hypothesis is usually created to be demonstrated as incorrect, in order to

    support an alternative hypothesis. When used in medical experiments the null

    hypothesis is typically stated as there being no significant difference between

    compared groups. Null Hypothesis testing is used when conducting clinical trials,

    and Heath states that “the null hypothesis driven medical research paradigm must

    inform DM investigative methods in the medical domain” (Heath 2006).

    Heath (2006) proposes the extended CRISP-DM model as a solution to this issue.

    This model uses exploratory data mining as a tool to find unknown patterns or

    relationship and creating hypotheses. Confirmatory data mining is subsequently used

    for null hypothesis testing. To enable the use of the extended CRISP-DM when

    discovering new trends and patterns in neonatal physiological data, a challenge exists

    to further extend this model to include provisions for Temporal Abstraction (TA) and

    for use on multidimensional parameters.

    In the extended CRISP-DM model exploratory data mining could be performed using

    a technique such as association rule mining, where the data can indicate a particular

    symbolic rule in the form of IF … THEN …(Bath 2004):

    IF condition(s) THEN conclusion

    Or, Condition(s) ⇒ Conclusion

    As pointed out by Heath (2006), these rules must be carefully considered and

    analysed by domain experts before being used for prediction, and suggests the use of

    rule interestingness measures for this evaluation. Rule interestingness measures is an

    active research area in the field of data mining and KDD (Ohsaki, Sato et al. 2002;

  • 21

    Ohsaki, Sato et al. 2004; Ohsaki, Kitaguchi et al. 2005; Ohsaki, Abe et al. 2007).

    Once a hypothesis has been created using exploratory data mining, a null hypothesis

    can be formulated and tested using confirmatory/predictive data mining techniques

    (Heath 2006).

    Predictive data mining was used by Goodwin and Maher (2000) to test hypotheses

    about premature birth, however there is no mention of testing null hypothesis. They

    compared the results from 5 different modeling techniques; neural networks, logistic

    regression, CART as well as purpose built software, PVRuleMiner and FactMiner to

    aid in the understanding of the causes of premature birth. The purpose is to identify

    patients at risk of giving birth prematurely and provide decision support for providers

    of perinatal care.

    Another issue with the use of data mining within medicine is the provision of data for

    data mining where it is classed as a secondary use of data, that is, the data used for

    mining is often not collected for research purposes, or collected for the purposes of

    another clinical research study. This leads to issues of data ownership, protection of

    patient privacy and confidentiality, and appropriate use of clinical information

    (Harrison Jr 2008). “The ethical, legal, and social limitations on medical data mining

    relate to privacy and security considerations, fear of lawsuits, and the need to balance

    the expected benefits of research against any inconvenience or possible injury to the

    patient.” (Cios and Moore 2002)

    There are several other issues which make data mining of medical data unique (Cios

    and Moore 2002; Harrison Jr 2008):

    - voluminous heterogeneous data

  • 22

    - high dimensionality

    - temporal patterns

    - often incomplete or imprecise

    - high level of noise

    - missing values

    - redundant, insignificant, or inconsistent data objects and/or values

    - special ethical, legal, and social constraints apply to medical data

    - unintuitive black box methods, like artificial neural networks, may be of less

    interest (Cios and Moore 2002), as the clinicians usually would like to

    understand how the result (model) was reached and would prefer

    transparency (white-box) – symbolic methods

    Applications of data mining in healthcare focus heavily on using predictive data

    mining based on pre-defined patterns, and looking for repeating patterns in the data

    stream(s) for one patient (Duchene, Garbay et al. 2007). When data mining is used

    for prediction based on human-defined ‘interesting’ patterns, there is a chance of

    introducing biases of the clinician/researcher working on the investigation who

    determined the patterns of interest (Lavrac, Keravnou et al. 2000). Another approach

    is to let the ‘data speak’, using exploratory data mining to find patterns and trends

    which drive new hypotheses.

    In the paper “Towards Role Based Hypothesis Evaluation for Health Data Mining”

    (2006), Shillabeer and Roddick propose the use of new terminology in the area of

    health data mining. They suggest that using the term rule for the results of data

    mining in a health context is misleading. Instead they suggest the use of hypothesis

  • 23

    (Shillabeer and Roddick 2006). This thesis will use hypothesis as a description of the

    outcome of the data mining preformed.

    2.6 Temporal Abstraction

    Time series appear in many different domains (Roddick and Spiliopoulou 2002),

    such as finance, meteorology and medicine. The properties of time series vary in

    terms of noise, volume and dimensionality. When analysing time series data, the

    individual data values by themselves often provide little meaning; however, when

    considered over time and context the values can become meaningful. This is

    especially important in NICU settings, where it is usually trends or changes over

    time, sometimes across multiple parameters, which are significant when analysing

    and predicting patient conditions. When analysing time series using data mining

    techniques in domains such as medicine, it is important to preserve the concept of

    time and context. Preserving the time and context can be achieved by applying

    temporal abstraction to the raw time series data before mining.

    Temporal abstraction (TA) is a technique used to summarise time series data to a

    higher level while preserving context and time. TA converts a time series into time

    interval series (Azulay, Moskovitch et al. 2007), and depending on the method used,

    can add qualitative information such as states and trends to a particular abstraction.

    TA can be simple or complex, and complex abstractions can be done across multiple

    parameters (see example on p.115, Figure 6.7). Temporal abstraction can be used to

    convert a time series into a range of symbols for further analysis.

  • 24

    In medicine temporal abstraction is used to convert low level raw numeric time series

    data into a higher level qualitative description which better matches the language

    used by medical professionals (Stacey and McGregor 2007). Time series data from a

    monitoring device attached to a baby in the NICU can be converted from a stream of

    numbers to intervals labelled for example high/low/normal (state) or

    increasing/decreasing/steady (trend), based on parameters set by domain experts.

    Complex temporal abstractions can be created from simple abstractions from one or

    more time series, relating these using Allen’s temporal relations. In his paper

    Maintaining knowledge about Temporal Intervals, Allen(1983), introduces temporal

    relations as a way of representing temporal patterns in time intervals. He presents

    thirteen relations (Figure 2.5) that can be used to depict the relationship between time

    intervals. The relations are equal, before, meets, overlaps, during, starts, finishes and

    the inverse of these (except the equal relationship). Allen describes these as “a basic

    set of mutually exclusive primitive relations that can hold between temporal

    intervals” (Allen 1984).

    Figure 2.5: The Thirteen Possible Relationships (Allen 1983; Allen 1984)

  • 25

    Shahar’s pivotal work presented a framework for Knowledge Based Temporal

    Abstraction (KBTA) which infers abstractions based on domain-specific knowledge

    stored in a formal knowledge base (Shahar 1997).

    Some healthcare systems use temporal abstraction to abstract to the level of

    descriptions or guidelines. Abstracting to this level enables the matching of these

    abstractions for guideline execution in clinical management. An example of which is

    the system developed by Seyfang et al (2001) for optimising oxygen supply for

    newborn infants. Their system, which is part of the Asgaard framework (Shahar,

    Miksch et al. 1998; Seyfang and Miksch 2004), uses the Asbru language (Seyfang

    and Miksch 2004; Fuchsberger, Hunter et al. 2005), abstracts raw monitoring data

    collected by NICU monitoring devices to the abstract concepts that are used in

    therapeutic plans. The data enters the system as a stream and the high-level

    abstractions derived from the raw data are compared to predefined conditions

    described in the therapeutic plans.

    RÉSUMÉ is a system which provides a “framework for deep knowledge

    representation to perform temporal abstraction of patient data” (Stacey and

    McGregor 2007). It uses CAPSUL, a temporal pattern representation language

    (Chakravarty and Shahar 2000; Antunes and Oliveira 2001). RÉSUMÉ is used on

    stored database data of low frequency of the abstracted parameters. The Tzolkin

    architecture uses RÉSUMÉ to create abstractions (Boaz and Shahar 2003). RASTA

    (A System for Temporal Abstraction) (O'Connor, Grosso et al. 2001) adds

    distributed capabilities to RÉSUMÉ to allow the system to be used for more complex

    settings (Augusto 2005).

  • 26

    In the article “PROTEMPA: A Method for Specifying and Identifying Temporal

    Sequences in Retrospective Data for Patient Selection” (Post and Harrison 2007), the

    authors describes the PROTEMPA method and how it has been implemented. It has

    a system for implementing temporal abstractions in stored time series data, both

    lower level (simple) and higher level (complex) abstractions. These abstractions are

    used for identifying pre-defined patterns in the time-series data. The system has the

    potential to be used in patient monitoring and decision support where the patters

    being looked for are predefined, however it is not used for discovering new patterns

    and relationships in the abstracted data. The temporal abstraction part of this method

    could be used in the pre-processing stage of data mining for clinical research,

    however the method does not include tools for mining the resulting abstractions for

    new patterns and relationships. As described in Post’s doctoral thesis (2006),

    "PROTEMPA is a hypothesis-testing system that scans time series data for pre-

    defined mathematical and temporal patterns of interest. This strategy is in contrast to

    a data mining tool that seeks to identify all meaningful patterns in a data set" (Post

    2006).

    Boaz and Shahar (2003) discusses the need for a temporal-abstraction database

    mediator to provide a useful method for “querying not only raw data, but also its

    abstractions”. In the research presented in this thesis we need to data mine the

    abstractions, rather than just query them. In their paper “Idan: A Distributed

    Temporal-Abstraction Mediator for Medical Databases” (Boaz and Shahar 2003) the

    temporal abstraction mediator IDAN is described. IDAN uses the generic temporal

    abstraction system ALMA (Boaz and Shahar 2005) for it’s temporal reasoning task,

    and ALMA uses KBTA/Temporal Abstraction Rule (TAR) language (Balaban, Boaz

  • 27

    et al. 2003; Boaz, Balaban et al. 2003) and CAPSUL. IDAN is used by multiple

    applications; KNAVE-II (Boaz and Shahar 2005) and DeGeL are examples.

    Most of the current temporal abstraction frameworks used in healthcare creates

    abstractions based on expert-defined rules. Verduijn et al. published a comparative

    case study performed on intensive care monitoring data, extracting meta features to

    be used for prediction (Verduijn, Sacchi et al. 2007). The study compares

    abstractions created using domain knowledge and data driven abstractions, and found

    that the data driven abstractions created more informative meta features. Open

    research areas involve the exploration of data driven abstractions, as well as the

    creation of frameworks integrating processes for temporal abstraction and data

    mining that can be applied to any temporal abstraction and data mining task on time-

    series data.

    Recent research abstracts multidimensional time series data to produce alerts when

    certain trends are detected (McGregor and Stacey 2007; Stacey, McGregor et al.

    2007). Currently, the rules for detecting these trends are human-defined; however,

    there may be as yet undiscovered trends and patterns that could indicate the onset of

    some condition, found by analysing historical data. Opportunities exist to apply data

    mining to temporally-abstracted cross correlated historical time series data of

    previous NICU patients, to identify new patterns and trends that may be of

    significance in the early identification of the onset of medical conditions in new

    NICU patients. These trends and patterns can be used to create rules for clinical alert

    systems within NICU monitoring equipment.

  • 28

    2.7 Data Mining and Temporal Abstraction

    When dealing with time series data or temporal data, data mining is rarely

    straightforward. Some pre-processing of the data usually needs to take place.

    Antunes & Oliveira (2001) discusses some pre-processing approaches in their article

    “Temporal Data Mining: an overview”, and stresses that “the representation problem

    is especially important when dealing with time series, since direct manipulation of

    continuous, high-dimensional data in an efficient way is extremely difficult”. Several

    approaches for dealing with time-series are presented. One of the possible solutions

    suggested is to use a “transformation that maps the data to a more manageable

    space”. The transformation would be included in the pre-processing of the data

    before data mining occurs. The paper presents several possible ways of pre-

    processing the data. The approach used in this research is to use temporal abstraction

    to pre-process the data before mining the abstractions created. CAPSUL and SDL are

    two languages mentioned in this article as suitable to perform abstraction on time-

    series data.

    Temporal data mining is an important extension to data mining and is discussed in

    the paper “A survey of temporal knowledge discovery paradigms and methods”

    (Roddick and Spiliopoulou 2002). Many approaches to temporal data mining are

    covered; however the problem of providing a flexible environment to support many

    temporal data mining studies on multidimensional data streams is not discussed. The

    paper discusses some interesting systems that appear to partially address the areas of

    interest to the research presented in this thesis. For example, the RX project uses

    temporal data to discover causal relationships. Also of interest in this paper for our

  • 29

    research is the discussion on sequence mining and SDL (Shape Definition

    Language).

    Duchene et al (Duchene, Garbay et al. 2007) has developed a prototype system to be

    used in the area of home health telemedicine. Although this is a different

    environment from the intensive care unit setting in terms of data rates and types of

    data, the system they developed is of interest to this research due to the way the data

    is pre-processed using temporal abstraction before being mined. The prototype

    system is mining heterogeneous multivariate time-series of data for a patient to

    discover and learn usual patterns for that particular patient. The purpose of the

    system is to be able to detect changes in the pattern profile, which can indicate a

    problem for the patient at home. In this system the focus is on one patient at a time.

    The research presented in this thesis will extend this concept to mining across

    multiple parameters for multiple patients to discover trends that can be indicative of

    the onset on some condition.

    To further assess the current state of knowledge in the area of temporal abstraction

    preprocessing for data mining, a review of the literature was completed.

    2.7.1 Review Method

    The literature review phase focused on reviewing papers related to temporal

    abstraction in IDA, with particular emphasis on abstraction of

    multivariate/multidimensional data; and particularly those papers combining research

    in both temporal abstraction and data mining as applied to clinical data. The

    integration of DM and TA is a relatively new research area, therefore the majority of

  • 30

    papers reviewed were published in the past decade, with many written in the past

    several years.

    During this review searches were conducted in health informatics, medical/clinical,

    computer science and engineering databases; during this effort research papers were

    sourced from the following databases: Science Direct, PubMed, ACM, IEEE, and the

    Web of Knowledge, as well as using Google Scholar and citation searching and

    chaining. Papers were located through a number of search keywords such as

    “temporal abstraction”, “temporal data mining”, “time series data mining”, “temporal

    abstraction + data mining”, “integration of data mining and temporal abstraction”,

    “discovering temporal association rules”, “time series analysis”, as well as

    combinations of these.

    2.7.2 Review Results

    Table 2.1, Clinical Context, contains characteristics of the clinical environment

    where the systems in each paper have been applied, as well as the characteristics of

    the data discussed. For each paper the table lists the frequency of the data, whether it

    is single or several parameters, real-time and/or distributed data.

    Some clinical environments such as ICU deal with high frequency, high volume

    clinical and physiological data from monitoring equipment (Silvent, Dojat et al.

    2004; Azulay, Moskovitch et al. 2007; Moskovitch, Stopel et al. 2007; Tusch G.

    2007; Verduijn, Sacchi et al. 2007), whereas others may deal with low frequency

    data such as test results over time (Ho, Kawasaki et al. 2004; Abe and Yamaguchi

    2005; Post and Harrison 2007).

  • 31

    Two papers reviewed considered both high and low frequency data in a multi-stream

    environment (Azulay, Moskovitch et al. 2007; Verduijn, Sacchi et al. 2007), and one

    of these (Verduijn, Sacchi et al. 2007) also considered real-time data. Three of the

    remaining papers dealt only with high frequency multi-stream data (Silvent, Dojat et

    al. 2004; Charbonnier and Gentil 2007; Moskovitch, Stopel et al. 2007), and the

    remaining papers utilised low frequency data (Abe and Yamaguchi 2005; Abe 2006 ;

    Post and Harrison 2007; Sacchi, Larizza et al. 2007; Tusch G. 2007). Only Bellazzi

    et al. (Bellazzi, Larizza et al. 2005) is working with distributed data. With all the

    papers reviewed, each of the papers had a particular study as the motivation for the

    data collection, and hence, none of the papers created an environment for flexible

    exploration to support different clinical research problems. The creation a flexible

    environment to support different clinical research problems is an open research area.

    Table 2.2, Data Mining and Temporal Abstraction Technique, records the types of

    abstractions that each study/system is using, such as qualitative abstractions (states,

    level shifts), or quantitative abstractions using various discretization methods as a

    preprocessing step to data mining (Moskovitch, Stopel et al. 2007). The knowledge

    base for the abstractions is listed as well as whether or not the system supports

    complex abstractions. The last column lists the data mining technique that has been

    used for the particular study/system. A variety of techniques were used for the

    temporal abstractions. A data driven approach to temporal abstraction was used by

    Azulay and Moskovitch (Azulay, Moskovitch et al. 2007; Moskovitch, Stopel et al.

    2007). Verduijn et al (Verduijn, Sacchi et al. 2007) utilised qualitative temporal

    abstractions to create state and trend abstractions. Sacchi et al (Sacchi, Larizza et al.

    2007) uses Shahar’s (1997) knowledge based approach; KBTA. Four of the papers

  • 32

    discussed creation of complex abstractions (Silvent, Dojat et al. 2004; Bellazzi,

    Larizza et al. 2005; Charbonnier and Gentil 2007; Post and Harrison 2007). As has

    been shown, many of the temporal abstraction approaches are not data driven. The

    discussion for the data driven temporal abstraction approaches are limited to the

    related clinical research problem, general frameworks suitable for use for a variety of

    conditions are not discussed. Alignment for condition onset prediction is also not

    discussed and hence these represent open research areas.

    Table 2.3, Clinical knowledge and Null Hypothesis testing, contains a column

    indicating if the particular system has support for co-mining, indicating if the system

    has approaches for integrating data mining and clinical reasoning, if there is support

    for null hypothesis testing as discussed earlier and also indication of physician

    involvement in the temporal abstraction or data mining process. Of the papers

    reviewed there was none that explicitly discusses null hypothesis testing, however

    one paper does discuss creating hypotheses to be evaluated (Abe and Yamaguchi

    2005). As a result the incorporation of null hypothesis testing within the data mining

    framework is an open research area.

    Clinical Context:

    # System/authors/year

    Clinical Environment

    Frequency Multiple streams? Real-time data? Distributed?

    1 (Verduijn, Sacchi et al. 2007) Verduijn et al. 2007

    ICU data 924 patients

    High frequency variables measured once a minute Low frequency variables measured several times a day

    High freq: mean arterial BP, central venous pressure, heart rate, TMP, FiO2, PEEP Low freq: cardiac output, base excess, creatinine kinase B and glucose value.

    Yes No

    2 (Tusch G. 2007) SPOT 2007

    Liver transplantation followed by ICU and clinical monitoring. Note: this is presented as a

    Clinical data comes in different granularities: hourly, daily, monthly, yearly

    Relationship of intervals rather than single parameter values establishes the clinical concept.

    Processing is not performed.

    No

  • 33

    medical example, but in this paper the data is not used for processing.

    3 (Azulay R. 2007) Azulay et al. 2007

    ICU dataset of cardiac surgery patients (664 patients)

    High frequency variables measured once a minute Low frequency variables measured once a day

    2 types of data. 1. static data (age, gender, surgery type, >24 hrs ventilation). 2. temporal data with high frequency variables (6 streams) and low frequency variables (4 streams).

    Processing is not performed on real-time data. Real-time data is collected for evaluating different discretization techniques.

    No

    4 (Moskovitch 2007) Moscovitch et al.

    ICU monitoring data (664 patients)

    Temporal data was measured each minute along the first 12 hours of hospitalization

    2 types of data: 1. static data (age, gender, surgery type, >24 hrs ventilation). 2. temporal data with high frequency variables (6 streams) and low frequency variables (4 streams).

    Processing is not performed on real-time data. Real-time data is collected for evaluating Morchen’s method as compared to Allen’s.

    No

    5 (Ho, Kawasaki et al. 2004) Ho et al.

    Hepatitis laboratory data 771 patients

    Not given, states that data is gathered at irregular intervals.

    States that there are multiple variables – does not say how many

    No – historical laboratory data. Considered ‘irregular temporal data’ since gathered from many lab tests over different periods spanning 20 years.

    Before applying methods all related data is joined by the combined key of the patient ID and test date.

    6 Abe and Yamaguchi (Abe 2005) (Abe 2006 )

    Chronic hepatitis

    Daily, weekly, monthly (+ randomized intervals based on patient’s statements)

    Includes clinical blood and urine tests on chronic hepatitis B and C

    No No

    7 (Sacchi, Larizza et al. 2007) Sacchi et al. 2007

    Predicting renal flares (acute episodes of illness) in lupus nephritis (a chronic autoimmune disease). Based on 228 patients (only 172 chosen)

    Low frequency, based on parameters obtained from periodic (not regular) clinical monitoring.

    Attempting to extract temporal rules to relate 4 parameters used for disease monitoring + one parameter for renal disease status. Average # of 9 measurements per patient with variable length of time series. Looked at variations in one or more variables.

    No No

    8 (Silvent, Dojat et al. 2005) Silvent et al. 2004

    Weaning from mechanical ventilation – 8 patients

    ECG, systemic arterial pressure and SpO2 Airway flow and pressure signals Signal acquisition at a sampling rate of 100Hz and resampled at 1Hz for temporal

    Yes, at each of the 3 stages of weaning (over a 4 hour period) physiological data is recorded and a medical assistant annotates all alarming situations. Data included physiological parameters and device settings.

    Yes No

  • 34

    processing. 9 (Post and Harrison

    2007) Post & Harrison. 2007 PROTEMPA

    HELLP syndrome (pregnancy complication in 3rd trimester) from clinical laboratory data. Data set had 761 eligible cases

    Low frequency, irregular data.

    Diagnosis based on 3 clinical laboratory tests

    No, based on lab test results. (retrospective data)

    No

    10

    (Bellazzi 2005) Bellazzi R et al.

    Assessment of the clinical performance of haemodialysis service. Data comes from 5800 dialysis sessions from 43 patients monitored over 19 months.

    Not stated. Based on haemodialysis data automatically collected during dialysis sessions. Data collected for each patient 3x week for 4 hours. Data are sequences of multidimensional time series. Based on automatic measurement of 13 variables

    Yes Yes

    11

    (Charbonnier S 2007) Charbonnier & Gibbons 2007

    Intensive care unit data collected from 8 patients collected at time of weaning from mechanical ventilation & ending of sedative drug administration.

    Data recorded every second.

    Yes. Article implies data is collected from range of variables monitored (obtained from monitoring devices), but does not list these devices.

    Yes. Data acquisition was carried out in real-time without interference from daily care.

    No.

    Table 2.1: Clinical Context

    Data Mining and Temporal Abstraction Technique:

    System/authors/year TA Technique (algorithm) or: Temporal processing

    Knowledge Base for TA defined by

    Supports complex abstractions? (TA)

    DM Technique (algorithm)

    1 (Verduijn, Sacchi et al. 2007) Verduijn et al. 2007

    Qualitative TA: concepts of state and trend are used for the abstraction (high-level description in terms of state and trend categories was derived for each time series over time intervals) Quantitative TA: derived from searching in a large space of numerical meta features..

    In one case, by the experts.

    Based only on basic states and trends. Not based on more complex abstractions, such as rate and acceleration.

    Class probability trees (CART) used as supervised learning algorithm

    2 (Tusch G. 2007) SPOT 2007

    TA: using Allen’s rules (goal is to make KBTA more readily available for clinical research through the ontology)

    Learned abstractions are submitted to the original database.

    Not discussed. Supports R statistical DM package (open source implementation of S)

    3 (Azulay R. 2007) Azulay et al. 2007

    Temporal discretization (pre-processing step to DM)

    Data driven

    Not discussed No DM preformed, focus on pre-processing

    4 (Moskovitch 2007) Moscovitch et al.

    SAX: discretization

    Time series are discretized based

    Unsupervised TSKM mining

  • 35

    method designed for time series data (uses an approximate distance function that lower bounds the Euclidean distance). Order of values of data taken into account only in preprocessing stage. Persist: univariate discretization method for KD in time series, explicitly considers the order of the variables in the time series. Assumes that any time series comes from uniform sampling (not the case in slow domains that are sampled infrequently and manually)

    on categories provided by an expert (this approach is compared to categories chosen from a data driven computational source)

    method which results in a set of phrases

    5 (Ho, Kawasaki et al. 2004) Ho et al.

    Goal: determine small number of typical abstraction patterns that can be used to characterize most real sequences. Approach: combination of TA primitive computing with human inspection using visual tools and expert opinions

    Create set of abstraction patterns viewed as a combination of TA primitives (so that each test sequence can be assigned to an abstraction pattern) Found 8 abstraction patterns for short-term changed tests and 22 for long term tests

    Applied various DM methods to extract knowledge: • C5.0

    and association rules in system Clementine (SPSS) • Their rule induction method LUPC and decision tree induction CARBO implemented in their system D2MS (Data Mining with Model Selection)

    6 Abe and Yamaguchi (Abe 2005)

    Pattern extraction involves first extracting sub-sequences and then clustering (they call this time series pattern extraction – do not use term TA)

    Unclear Not discussed Developed a tool based on constructive meta-learning called CAMLET Taken PART implemented in Weka to evaluate improvement of pattern extraction algorithm.

    7 (Sacchi, Larizza et al. 2007) Sacchi et al. 2007

    Shahar’s KBTA

    User defined Starting from TA representation, they run an algorithm for the extraction of temporal rules

    Algorithm implements a search strategy based on apriori technique where the quality of a

  • 36

    expressing temporal relationships between the detected temporal patterns

    rule is assessed in terms of confidence and support (described in (Bellazzi 2005) as temporal data mining)

    8 (Silvent, Dojat et al. 2005) Silvent et al. 2004

    Symbolic trends were computed for every parameter, with associated characteristic periods. Trend was computed using linear regression on a window whose size was determined according the dynamics of the parameter under study and called ‘characteristic span’.

    Corresponds to the identification of the prior knowledge and/or extracted information necessary to drive the abstraction;

    Yes. Identified two patterns that differ according to 2 thresholds on time delay and variation means, which when applied to SpO2 can be used to recognize a desaturation.

    Searched for relations between complex abstractions. Identified 3 association rules (1 was evaluated by clinicians as correct). Seems like the association rule was based on only 1 parameter value

    9 (Post and Harrison 2007) Post & Harrison. 2007

    FW contains an algorithm source which describes how the temporal data should be processed. Low level mechanisms based on a sliding window. TAs accomplished via execution of R for Java (algorithms are written as functions in the R language)

    FW contains knowledge source which defines interval relationships that define abstractions of interest. Relationships are specified as min and max temporal distances between the endpoints of participating inte