a framework for temporal abstractive multidimensional data...
TRANSCRIPT
-
University of Western Sydney School of Computing and Mathematics
A Framework for Temporal
Abstractive Multidimensional Data
Mining
Heidi Bjering Stratti
A dissertation submitted in fulfilment of the requirements of
Master of Science (Hons)
2008
-
Acknowledgements There are many people who deserve acknowledgement for their role in this thesis
becoming a reality. Juggling research while working full time and being there for my
two children has been quite a difficult balancing act which I have not always
balanced very well. This last year has been a very difficult year for my family and
me, and I had many moments where I was unable to see how I would manage to
finish this research work.
First and foremost, the person who has had the most impact in this thesis becoming a
reality, despite being located in Canada, is my principal supervisor Dr Carolyn
McGregor. I could not have done this without her fantastic guidance, encouragement
and support. She got me through to the finish line during what has been one of my
toughest years. When I was feeling at my lowest and was completely discouraged,
she managed within an hour on Skype to get me going again. Never once a negative
word (even when I thoroughly deserved it). I would like to thank Carolyn for
providing me with academic guidance as well as friendship through these last couple
of years.
I also wish to thank my co-supervisor, Dr Mark Tracy, who came on board as my co-
supervisor as Carolyn relocated to Canada.
My fantastic sons, Robbie and Daniel, keep on impressing me with their take on life.
I wish to thank them for their support and putting up with me while I have been
doing this work.
-
I also would like to thank Darren for being by my side through this very challenging
year, for putting up with all my stress, and for the many dinners he has cooked for
me.
My friend Alphia has always been there ready to listen when I needed to talk. I thank
Alphia for her friendship, support and encouragement – and the many refreshing
walks we have had to clear our heads.
-
Statement of Authentication
The work presented in this thesis is, to the best of my knowledge and belief, original
except as acknowledged in the text. I hereby declare that I have not submitted this
material, either in full or in part, for a degree at this or any other institution.
…………………………………………………………
Heidi Bjering Stratti
-
i
Table of Contents Acknowledgements ...................................................................................................... 2 Statement of Authentication......................................................................................... 4 Table of Contents .......................................................................................................... i List of Tables .............................................................................................................. iv Abstract ......................................................................................................................vii 1. Chapter 1 – Introduction ...................................................................................... 1
1.1 Why is temporal abstractive data mining important to health and medicine2 1.2 Research motivation..................................................................................... 4 1.3 Research aims and objectives....................................................................... 5 1.4 Contribution to knowledge........................................................................... 6 1.5 Research method .......................................................................................... 7 1.6 Thesis overview ........................................................................................... 8
2. Chapter 2 – Literature Review ........................................................................... 10 2.1 Introduction................................................................................................ 10 2.2 Knowledge Discovery in Data ................................................................... 11 2.3 KDD and Intelligent Data Analysis ........................................................... 16 2.4 Data Mining ............................................................................................... 17 2.5 Application of Data Mining within Medicine............................................ 19 2.6 Temporal Abstraction ................................................................................ 23 2.7 Data Mining and Temporal Abstraction .................................................... 28
2.7.1 Review Method .................................................................................. 29 2.7.2 Review Results................................................................................... 30
2.8 Conclusions and impact on future research ............................................... 38 3. Chapter 3 – The NICU environment.................................................................. 39
3.1 The Australian Context .............................................................................. 39 3.2 Admittance to NICU .................................................................................. 40 3.3 Medical Devices and NICU monitoring .................................................... 42 3.4 Alerts .......................................................................................................... 45 3.5 Known Physiological Onset Predictors...................................................... 46 3.6 Similarities between NICU and other environments ................................. 47
4. Chapter 4 – Solution Manager Service and e-Baby Architecture ...................... 49 4.1 e-Baby Architecture ................................................................................... 50 4.2 Solution Manager Service .......................................................................... 53 4.3 Conclusion and Implication on this Research............................................ 57
5. Chapter 5 – Design of methodology for multidimensional TA DM framework (TAMDDM)............................................................................................................... 59
5.1 Multi Agent System ................................................................................... 61 5.1.1 Database Access Server ..................................................................... 65 5.1.2 Processing Agent................................................................................ 66 5.1.3 Temporal Agent ................................................................................. 67 5.1.4 Relative Agent.................................................................................... 68 5.1.5 Functional Agent................................................................................ 70 5.1.6 Rules Generating Agent ..................................................................... 71
5.2 Extended CRISP-DM Model ..................................................................... 72 5.2.1 Data Understanding............................................................................ 75 5.2.2 Data Preparation................................................................................. 76 5.2.3 Modeling ............................................................................................ 77
-
ii
5.2.4 Modelling: DM Rule-Set Generation and Select Significant Rule-Set 77 5.2.5 Modelling: Formulate Null Hypothesis ............................................. 78 5.2.6 Modelling: Run Statistical Processes to test Null Hypothesis ........... 79 5.2.7 Evaluation: Load Accepted Rule-Sets into IDSS............................... 80
5.3 TAMDDM Framework Tasks.................................................................... 81 5.3.1 Local Collection and Cleanup............................................................ 81 5.3.2 Temporal Abstractions, Simple & Complex, Multi-Stream .............. 82 5.3.3 Relative Alignment ............................................................................ 83 5.3.4 Exploratory Data Mining across multiple streams for multiple patients 85 5.3.5 Confirmatory Data Mining with Null Hypothesis ............................. 87 5.3.6 Hypothesis/Rule generation ............................................................... 88
5.4 Data storage................................................................................................ 89 5.4.1 Clinical Data ...................................................................................... 89 5.4.2 Diagnosis............................................................................................ 90 5.4.3 Patient Diagnoses ............................................................................... 90 5.4.4 Physiological Data ............................................................................. 91 5.4.5 Physiological Parameter..................................................................... 92 5.4.6 Temporal Rules .................................................................................. 92 5.4.7 Temporal Data.................................................................................... 93 5.4.8 Relative Rule...................................................................................... 94 5.4.9 Relative Temporal Abstractions......................................................... 95 5.4.10 RuleBase Data.................................................................................... 96
5.5 Conclusion ................................................................................................. 96 6. Chapter 6 – Demonstration within the NICU context........................................ 98
6.1 Processing Agent...................................................................................... 100 6.1.1 Mapping external source data to DBAS data structure.................... 103
6.1.1.1 Mapping external source table Baby to DBAS table Baby:......... 103 6.1.1.2 Mapping external source table Diagnosis to DBAS table BabyConditions............................................................................................ 104 6.1.1.3 Mapping external source table Condition to DBAS table Conditions 105 6.1.1.4 Mapping external source flat file physiologicalData to DBAS table PhysData 105
6.1.2 Mapping from DBAS data structure to internal TAMDDM data stores 106
6.1.2.1 Mapping DBAS table Baby to TAMDDM table Patient ............. 107 6.1.2.2 Mapping DBAS table PhysData to TAMDDM table PatientPhysiologicalParameter..................................................................... 107 6.1.2.3 Mapping DBAS table MFC to TAMDDM table PhysiologicalParameter................................................................................ 107 6.1.2.4 Mapping DBAS table BabyCondition to TAMDDM table PatientDiagnosis........................................................................................... 108 6.1.2.5 Mapping DBAS table Conditions to TAMDDM table Diagnosis108
6.2 Temporal Agent ....................................................................................... 108 6.3 Relative Agent.......................................................................................... 116 6.4 Functional Agent...................................................................................... 121 6.5 Rules Generating Agent ........................................................................... 123 6.6 Future Research Application.................................................................... 123
-
iii
6.7 Issues associated with Case Study ........................................................... 125 6.8 Impact of Demonstration.......................................................................... 125 6.9 Conclusion ............................................................................................... 126
7. Chapter 7 – Conclusion.................................................................................... 127 7.1 Summary .................................................................................................. 127 7.2 Contributions............................................................................................ 130 7.3 Limitations and further research .............................................................. 131 7.4 Impact....................................................................................................... 132 7.5 Conclusion ............................................................................................... 133
References ................................................................................................................ 134
-
iv
List of Tables Table 2.1: Clinical Context ........................................................................................ 34 Table 2.2: Data Mining and Temporal Abstraction Technique ................................. 37 Table 2.3: Clinical knowledge and Null Hypothesis testing...................................... 38 Table 3.1: NSW/ACT Hospital Classification of NICU Services ............................. 40 Table 3.2: Apgar score ............................................................................................... 41 Table 3.3: Component Monitoring System (CMS) data types................................... 46 Table 6.1: Mapping of Baby table (source) to Baby table (DBAS)......................... 104 Table 6.2: Mapping of Diagnosis table (source) to BabyConditions table (DBAS) 105 Table 6.3: Mapping of Condition table (source) to Conditions table (DBAS) ........ 105 Table 6.4: Mapping of physiological data in flat file (source) to PhysData table (DBAS) .................................................................................................................... 106 Table 6.5: Mapping of Baby table (DBAS) to Baby table (TAMDDM)................. 107 Table 6.6: Mapping of physiological data PhysData table (DBAS) to PatientPhysiologicalParameter table (TAMDDM).................................................. 107 Table 6.7: Mapping of MFC table (DBAS) to PhysiologicalParameter table (TAMDDM)............................................................................................................. 107 Table 6.8: Mapping of BabyCondition (DBAS) to PatientDiagnosis (TAMDDM) 108 Table 6.9: Mapping of Conditions table (DBAS) to Diagnosis table (TAMDDM) 108 Table 6.10: Raw SaO2 readings............................................................................... 110 Table 6.11: Abstractions created from all SaO2 readings in Table 6.10 ................. 112 Table 6.12: Blood pressure values ........................................................................... 112 Table 6.13: Abstractions created from all blood pressure readings in Table 6.12... 113 Table 6.14: Complex Abstractions.......................................................................... 115 Table 6.15: Temporal Abstractions using absolute start and end times................... 119 Table 6.16: relative temporal abstractions of abstractions from Table 6.11, using diagnosis time of 20061116_13:29:02.001 as the event of interest for realignment of the patients data........................................................................................................ 120
-
v
List of Figures Figure 1.1: Elements of constructive research ............................................................. 8 Figure 2.1: Overview of the steps constituting the KDD process (Fayyad, Piatetsky-Shapiro et al. 1996) .................................................................................................... 12 Figure 2.2: The phases of the CRISP-DM reference model (CRISP-DM 2000)....... 14 Figure 2.3: The six-step DMKD process model (Cios and Moore 2002).................. 15 Figure 2.4: Relationship between clinical research and clinical management........... 17 Figure 2.5: The Thirteen Possible Relationships (Allen 1983; Allen 1984).............. 24 Figure 3.1: NICU physiological monitor (Neonatology on the Web) ....................... 43 Figure 4.1: The e-Baby Architecture (McGregor, Heath et al. 2005)........................ 52 Figure 4.2: Solution Manager Service ....................................................................... 56 Figure 5.1: The TAMDDM Framework .................................................................... 60 Figure 5.2: Solution Manager Service (SMS)............................................................ 62 Figure 5.3: Multi-agent framework (Foster 2008) ..................................................... 62 Figure 5.4: The extended multi agent system. ........................................................... 64 Figure 5.5: The multi agent layer in the TAMDDM framework ............................... 65 Figure 5.6: Processing Agent ..................................................................................... 66 Figure 5.7: Temporal Agent....................................................................................... 67 Figure 5.8: Relative Agent ......................................................................................... 68 Figure 5.9: Functional Agent ..................................................................................... 70 Figure 5.10: Rules Generating Agent ........................................................................ 71 Figure 5.11: Parallelism between CRISP-DM and the Scientific Method (Heath 2006) .......................................................................................................................... 73 Figure 5.12: CRISP-DM extended for Clinical Practice and Research as proposed by Heath‘s thesis (2006) ................................................................................................. 74 Figure 5.13: Extended CRISP-DM model in the TAMDDM framework ................. 75 Figure 5.14: Data Understanding ............................................................................... 75 Figure 5.15: Data Preparation .................................................................................... 76 Figure 5.16: Modelling: DM Rule-set Generation and Select Significant Rule-set phases ......................................................................................................................... 77 Figure 5.17: Formulate Null Hypothesis phase ........................................................ 78 Figure 5.18: Run Statistical Processes to test Null Hypothesis ................................. 79 Figure 5.19: Load Accepted Rule-sets into IDSS phase ............................................ 80 Figure 5.20: Local Collection and Cleanup ............................................................... 81 Figure 5.21: Temporal Abstractions .......................................................................... 82 Figure 5.22: Relative Alignment................................................................................ 83 Figure 5.23: Data before relative alignment .............................................................. 84 Figure 5.24: Data after relative alignment ................................................................. 85 Figure 5.25: Exploratory Data Mining....................................................................... 85 Figure 5.26: Confirmatory Data Mining with Null Hypothesis................................ 87 Figure 5.27: Hypothesis/Rule generation.................................................................. 88 Figure 5.28: TAMDDM data store ............................................................................ 89 Figure 5.29: Patient Table .......................................................................................... 89 Figure 5.30: Diagnosis Table ..................................................................................... 90 Figure 5.31: PatientDiagnosis Table .......................................................................... 91 Figure 5.32: PatientPhysiologicalParameter Table .................................................... 91 Figure 5.33: PhysiologicalParameter Table ............................................................... 92 Figure 5.34: TA_Rule Table ...................................................................................... 92 Figure 5.35: TemporalAbstraction Table................................................................... 93
-
vi
Figure 5.36: Study Table............................................................................................ 94 Figure 5.37: TA_RelativeTime Table ........................................................................ 95 Figure 6.1: Result database structure passed from DBAS to agents in the multi-agent system (Foster 2008) ................................................................................................ 101 Figure 6.2: Structure of the source data regarding babies........................................ 102 Figure 6.3: Structure of the source physiological data............................................. 103 Figure 6.4: modified BabyCondtions table .............................................................. 104 Figure 6.5: Graphing of SaO2 values against a threshold of 90%........................... 111 Figure 6.6: Graphing of blood pressure values against a threshold of 24mm/Hg ... 113 Figure 6.7: Complex Abstraction............................................................................. 115 Figure 6.8: realignment of abstracted parameters relative to diagnosis................... 117
-
vii
Abstract
In the industrialised world, premature birth has been recognised as one of the most
significant perinatal health issues (Kramer, Platt et al. 1998). In Australia 8.1% of
babies are born before 37 weeks gestation (Laws, Abeywardana et al. 2007).
Premature babies often have prolonged stays in Neonatal Intensive Care Units
(NICUs) and can suffer from a number of different conditions during their stay.
Some of these conditions have been shown to exhibit certain variations in their
physiological parameters that can indicate the onset of such conditions, before it can
be detected by other means. Medical monitoring equipment produces large masses of
data, which makes analysing this data manually impossible. Adding to the
complexity of the large datasets is the nature of physiological monitoring data – the
data is multidimensional, where it is not only changes in individual dimensions that
are significant, but sometimes simultaneous changes in several dimensions. As the
time-series produced by the monitoring equipment is temporal, there is a need for
clinical research frameworks that enables both the dimensionality and temporal
behaviour to be preserved during data mining.
The aim of this research is to extend previous research that proposed a framework to
support analysis and trend detection in historical data from Neonatal Intensive Care
Unit (NICU) patients. The extensions contribute to fundamental data mining
framework research through the integration of temporal abstraction and support of
null hypothesis testing within the data mining processes. The application of this new
data mining approach is the analysis of level shifts and trends in historical temporal
data and to cross correlate data mining findings across multiple data streams for
-
viii
multiple neonatal intensive care patients in an attempt to discover new hypotheses
indicative of the onset of some condition. These hypotheses can then be evaluated
and defined as rules to be applied in the monitoring of neonates in real-time to enable
early detection of possible onset of conditions. This can assist in faster decision
making which in turn may avoid conditions developing into serious problems where
treatment may be futile.
This research employs a constructive research method. In this research, the problem
is the inability of current data mining frameworks to completely support clinical
research in multidimensional temporal data. This research has resulted in the design
of a temporal abstraction multidimensional data mining (TAMDDM) framework
suitable for clinical research in multidimensional temporal time series data. The
framework is demonstrated through a case study with neonatal intensive care
monitoring data.
-
1
1. Chapter 1 – Introduction This thesis presents a framework to enable multi-dimensional data mining on time
series data that exhibits temporal behaviours. This research is demonstrated through
a case study utilising physiological time series data streams, together with other
clinical data streams collected from patients in Neonatal Intensive Care Units
(NICUs) for the detection of trends and patterns in multi-dimensional stored
physiological data. The purpose of detecting these trends and patterns is to recognise
indicators for the onset of some condition in the neonate, to enable the creation of
hypotheses that can be transformed into rules suitable for use in intelligent
monitoring systems.
The large volumes of data present in medical data renders traditional manual data
analysis inadequate (Lavrac 1999). Data mining is the process of analysing large
amounts of data to find new patterns and relationships within the data, by using
techniques such as statistics, machine learning and pattern recognition. Multi-
dimensional data is data that consist of more than one variable. In physiological data
streams, multi-dimensional data means that for one patient there are values for
several different data streams, for example arterial oxygen saturation AND blood
pressure. Each of these data streams separately represents a single dimension. When
combined together the data becomes multi-dimensional. Temporal abstraction (TA)
is a technique used to summarise time series data to a higher level while preserving
context and time, usually adding qualitative information such as states and trends to a
particular abstraction. Sections of data which holds true for certain criteria, such as
high (or low), can be summarised with a start time and end time for when this criteria
is true, as well as a label (high or low) to describe the abstraction. This is particularly
-
2
relevant for a NICU setting, where it is usually trends or changes in the physiological
data over time, sometimes across multiple parameters, which are significant when
analysing and predicting patient conditions.
The remainder of this chapter is structured as follows; first temporal abstractive data
mining and its importance to health and medicine is discussed, before the motivation
for the research presented in this thesis is described. The discussion on motivation
for this research leads to the section outlining the research aims and objectives by
listing the research hypotheses for this thesis, before listing the contributions to
knowledge that have resulted from this work. The research method utilised for this
research is presented, before the overview of the content of this thesis concludes the
chapter.
1.1 Why is temporal abstractive data mining important to health and medicine
The medical industry generates large amounts of medical data, of which very little is
used for extracting useful information (Hanson 2006). For example, modern
equipment in intensive care units (ICUs) produce vast amount of data from monitors
connected to patients (Horn 2001). The situation is the same in NICUs where the
focus of this research lies. Babies in a NICU usually have a range of monitoring and
life support devices attached to them. These devices produce large amounts of data
which can be essential in deciding on treatment options. However, the amount of
data makes it very hard for the neonatologist to extract useful information. The data
is displayed on monitors, often in waveforms, and a static ‘picture’ is recorded by
staff at regular intervals. Between each manual recording there can be small changes
to the data which is never noted. These changes may be important in being able to
predict the onset of some condition. Sepsis, a common illness in neonates, has been
-
3
shown to exhibit changes in physiological data before the condition can be diagnosed
through blood cultures (Griffin and Moorman 2001; Griffin, O'Shea et al. 2003;
Griffin, O'Shea et al. 2004; Griffin, Lake et al. 2005; Griffin, Lake et al. 2007). This
indicates that subtle changes that may not be apparent through the current practice of
manual recording of the physiological data at regular intervals can be important in
detecting the onset of condition in neonates. There could also be indicators which
exhibit across multiple parameters that together indicate the onset of some condition.
This situation has created the need for systems aimed at clinical management to help
analyse the data produced by the monitoring and life support devices connected to
the babies. These systems look for certain trends or patterns that have previously
been defined, often by experts in the field. However, “human-defined rules risk
capturing the biases of one expert” (Lavrac, Keravnou et al. 2000), therefore clinical
research on historical data is needed to find new previously undiscovered trends and
patterns. Modern technology allows this data to be used as input in processes to
attempt to derive information and new knowledge, known as knowledge discovery in
data (KDD).
“The gap between data generation and data comprehension is widening in all
fields of human activity. In medicine, overcoming this gap is particularly
crucial since medical decision making needs to be supported by arguments
based on basic medical knowledge as well as knowledge, regularities and
trends extracted from data.” (Lavrac, Keravnou et al. 2000)
There is a need for frameworks to facilitate clinical research on stored historical
physiological patient monitoring data to enable the discovery of previously unknown
trends and patterns that may be indicative of the onset of some condition. Research in
this kind of data brings challenges: the volume of the data is massive; the data is
multidimensional, as for each patient there are several parameters which can not be
-
4
considered in isolation from each other; the data is temporal, which means that
individual data values do not necessarily provide much meaning, however when
these values are considered in relation to their neighbouring values they can provide
meaning. Therefore each individual value must be considered in both time and
context. Opportunities exist for the exploration of data to determine the existence of
pre onset behaviours for multiple conditions and diagnosis.
1.2 Research motivation
As will be shown in the literature review in chapter two, there is an absence of
flexible multidimensional approaches to data mining of time series data. Chapter two
also discusses the need for representation of temporal behaviour in this type of data
to enable this temporal behaviour to be preserved in the mining process, as often
individual data values by themselves do not have much meaning, however when
considered over time and context meaning can be derived.
Current monitoring systems used in NICUs are not capable of monitoring cross
correlated temporal rules in multiple data streams. Current research by Stacey and
McGregor (McGregor and Stacey 2007; Stacey, McGregor et al. 2007) in this area is
showing the possibilities of such systems becoming a reality. For these systems to be
effective, cross correlated rules for multiple parameters must be created for use by
the alarming component of the system. Currently such rules are created by domain
experts, however there is potential for yet undiscovered rules to be hiding in the vast
amounts of data produced by the monitoring equipment connected to the patient. To
enable creation of hypotheses that can be turned into such rules, there is need for a
framework that enables clinical research in this type of multi-dimensional temporal
-
5
data, where the rigour necessary for medical research, including null-hypothesis
testing, can be accommodated.
With a reduction in storage cost and a corresponding increase in the ability to collect
and store temporal data through real-time clinical monitoring, there comes the
opportunity to analyse collected data along time (Moskovitch 2007). This is
especially significant in clinical environments, where individual data elements in
clinical records may not be meaningful outside of a particular temporal context.
However, when considered over time and context, the values and their inter-
relationships can become significant. This is particularly true in acute care settings,
such as neonatal care, where it is usually trends or changes over time, sometimes
across multiple parameters, which are significant when predicting the onset of future
patient conditions. As an added complexity, patients who have the same condition
may have substantially different types and timing of observations, unlike retail data
which generally have comparable complements of data elements obtained at similar
times (Harrison Jr 2008).
1.3 Research aims and objectives
The issues of clinical research in physiological data together with a review of the
current functionality of data mining introduced in the two previous sections led to the
creation of 5 research hypotheses. The research hypotheses of this thesis are that:
1. A multidimensional data mining (MDDM) framework can be defined for
clinical research to discover trends and patterns indicative of the onset of
some condition.
-
6
2. The abovementioned framework will include methods for applying temporal
abstraction (TA) across multiple parameters for multiple patients to enable
mining of multi-dimensional temporal data
3. The TAMDDM framework can be applied in a neonatal context
4. The TAMDDM framework can support null hypothesis testing
5. The hypotheses generated by the framework can be used by a real-time event
stream processor analysing the current condition of babies in a Neonatal
Intensive Care Unit (NICU)
1.4 Contribution to knowledge
The areas of research contribution to knowledge resulting from this thesis are:
• Extensions to a multi agent framework previously designed for analysing time series data, to facilitate temporal abstraction and realignment of
these abstractions (as presented in chapter 5)
• Enable incorporation of the extended CRISP-DM model within the multi agent framework (as presented in chapter 5).
• Design of a framework to enable temporally abstractive multi dimensional data mining (as presented in chapter 5)
• Enhancement of the interaction between clinical research and clinical management by generating a framework for clinical research which can
produce hypotheses that will feed into intelligent monitoring systems used
in clinical management. The clinical research framework uses as input
data from the various monitoring equipment used in clinical management
processes (as presented in chapter 5 and chapter 6)
-
7
1.5 Research method
For this research a constructive research method is used. This is a research method
widely used in computer science disciplines, information systems, management
accounting and medical domains (Kasanen 1993; Curry 2000; Shaw 2001; Shapiro
2003). The key in constructive research is the development of a new construct in
response to an explicit problem (Lassenius, Soininen et al. 2001). The construct,
which can be a new model, software, theory, framework or algorithm, is tested for
usability and enables theoretical conclusions to be made. The aim is to take a real
world practical problem and produce a real world solution.
Lassenius et al (2001) describes 6 phases of the constructive research method. These
phases can be iterative and recursive:
1. Find a practically relevant problem
2. Obtain an understanding of the topic and the problem
3. Innovate, i.e., construct a solution idea
4. Demonstrate that the solution works
5. Show theoretical connections and research contribution
6. Examine the scope of applicability
For this research the practically relevant problem is the need to be able to use stored
clinical data across multiple parameters to identify trends and level shifts that can be
relevant in early recognition of problems for patients, in this case neonates.
Obtaining an understanding of the topic and the problem is achieved by reviewing
literature in the area of knowledge discovery in databases, data mining in medicine
-
8
and temporal abstraction. Designing a framework for the temporal abstraction and
data mining of multidimensional parameters completes the phase of constructing a
solution idea. Applying the framework tasks to examples from a neonatal intensive
care unit (NICU) demonstrates that the solution can be applied in a NICU setting.
Finally the thesis will show the theoretical connections and research contributions
made, as well as examine the scope of applicability in other areas.
CONSTRUCTION,
problem solving
Practical relevance
Theory connection
Practical functioning
Theoretical contribution
Figure 1.1: Elements of constructive research
The goal of constructive research, as described by Lassenius et al (Lassenius,
Soininen et al. 2001) is to produce “Innovative constructs, intended to solve
problems faced in the real world and, by that means, to make a contribution to the
theory of the discipline in which it is applied” (Lassenius, Soininen et al. 2001)
1.6 Thesis overview
Chapter 2 presents a literature review of the areas of influence for this thesis, mainly
knowledge discovery in databases (KDD), data mining and temporal abstraction. The
chapter explores these areas in their application to medical systems in particular, to
expose the open health informatics research areas that led to the formation of the
research hypotheses addressed by the techniques proposed in this research. In chapter
3, further background is provided by describing the neonatal intensive care unit
-
9
(NICU) environment which provides the setting for the Temporal Abstractive
MultiDimensional Data Mining (TAMDDM) framework designed in this thesis.
Before describing the design of the TAMDDM framework, chapter 4 gives an
introduction to the previous research that this research builds on, the e-Baby project
and the solution manager service (SMS), which contains the analytical processor
where the TAMDDM framework will reside. The TAMDDM framework
components, including the multi-agent system and its extensions, the extended
CRISP-DM model and the TAMDDM tasks are described. Chapter 6 demonstrates
how the TAMDDM framework can be used for conducting clinical research within
the NICU domain. Chapter 7 provides the conclusion to the thesis, summarising the
contributions of this work and provides direction for future research.
-
10
2. Chapter 2 – Literature Review
2.1 Introduction
The main motivation for the research in this thesis is the identified gap that exists
between clinical management and clinical research (Foster, McGregor et al. 2005).
This research is particularly interested in the intensive care unit (ICU) environment
where observation of the patient’s condition is supported through the provision of
several physiological time series data streams via medical monitors such as heart rate
and mean blood pressure. There exists the possibility of discovering new knowledge
that may exist in patient data, which can indicate the onset of some condition such as
sepsis (a blood poisoning condition).
Patient data, in particular monitoring data, is inherently temporal. Based on the
research hypotheses presented in Chapter 1, this literature review chapter will mainly
focus on temporal abstraction and data mining, which are part of the overall
knowledge discovery in data (KDD) process. Data mining is the chosen discovery
technique, and as patient data, in particular monitoring data, is naturally temporal, a
technique is needed to preserve the temporal aspect of the data when it is mined.
Therefore temporal abstraction forms part of the preprocessing step.
The remainder of this chapter is structured as follows. First the area of Knowledge
Discovery in Data is introduced. The relationship between Knowledge Discovery in
Data and Intelligent Data Analysis is presented. The data mining component within
Knowledge Discovery in Data is defined and the issues exposed by other researchers
relating to the application of Data Mining within the domain of medicine are
discussed. The concept of temporal abstraction is introduced. Previous researcher’s
-
11
use of temporal abstraction to support the preprocessing within data mining is
presented and analysed to uncover open research areas that led to the creation of the
research hypotheses for this thesis.
2.2 Knowledge Discovery in Data
Although the area of medical informatics is a relatively new research area,
application of analytical techniques in medicine has a long history. In the mid
eighteen hundreds researchers collected data and looked for patterns in an attempt to
provide some causal understanding of the pandemic in London at the time. (Brown
2008)
In current time, the ability to gather and store data is steadily increasing, resulting in
“data rich times” (Brown 2008). Efficient techniques are needed to enable the
analysing and understanding of these resources. The possibility of the discovery of
hidden knowledge in this massive amount of data is driving the development of the
field of Knowledge Discovery in Databases (KDD) and data mining. Fayyad and
Stolorz (1997) define data mining as “the application of specific algorithms for
extracting patterns from data”. To enable knowledge to be derived from the data, a
larger process needs to envelope the data mining step. This process includes data
preparation and selection, cleaning and preprocessing of the data, including any prior
knowledge about the data into the process, as well as interpretation of the mining
results. KDD refers to this overall process of discovering knowledge in data (Figure
2.1) where data mining is one of the steps, using specialised tools for this task
(Holmes and Peek 2007). Holmes and Peek (2007) states that KDD is a “process
where the data are used for hypothesis generation”, and provides a framework to
avoid “fishing expeditions”, also called data dredging.
-
12
Figure 2.1: Overview of the steps constituting the KDD process (Fayyad, Piatetsky-Shapiro et al.
1996)
With the large amounts of data collected by various applications and processes,
manual data analysis and interpretation is not feasible, can be very subjective and is
impractical as data volumes grow exponentially (Fayyad, Piatetsky-Shapiro et al.
1996). As Fayyad et al (1996) points out, “the true value of such data lies in the
users’ ability to extract useful reports, spot interesting events and trends, support
decisions and policy based on statistical analysis and inference, and exploit the data
to achieve business, operational, or scientific goals”. These user goals were a driver
behind the development of the KDD research area in the late eighties and early
nineties. The term KDD was conceived in 1989 to “emphasize that ‘knowledge’ is
the end product of a data-driven discovery” (Fayyad and Stolorz 1997). Knowledge
Discovery in Databases (KDD) is sometimes also referred to as Knowledge
Discovery in Data (Heath 2006).
Fayyad et al (1996) defines the KDD process as: “The nontrivial process of
identifying valid, novel, potentially useful, and ultimately understandable patterns in
data.” Here, pattern is referring to a subset of the data, or a model that is relevant to
that subset and should be valid for new data, and process is indicating several
iterative and interactive steps; data preparation, search for patterns, knowledge
evaluation, refinement.
-
13
Fayyad et al (1996) define 9 steps in the KDD process:
1. Learning the application domain
2. Creating a target dataset
3. Data cleaning and preprocessing
4. Data reduction and projection
5. Choosing the function of data mining
6. Choosing the data mining algorithm(s)
7. Data mining
8. Interpretation
9. Using discovered knowledge
In their book Data mining: A knowledge discovery approach (Cios, Swiniarski et al.
2007), the authors stress that knowledge discovery focuses on the whole process,
including before and after modeling, rather than just the data mining modeling part.
They call this the Knowledge Discovery Process (KDP), “a process that seeks new
knowledge about an application domain” (Cios, Swiniarski et al. 2007). They cite
Fayyad’s 9 step KDD model above as the leading research model for KDD, and the
CRISP-DM (Cross-Industry Standard Process for Data Mining) as the leading
industrial model. The CRISP-DM model consists of six phases (Figure 2.2):
1. Business Understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation
6. Deployment
-
14
Figure 2.2: The phases of the CRISP-DM reference model (CRISP-DM 2000)
As Figure 2.2 illustrates through the outer circle with clockwise arrows, the CRISP-
DM model is iterative. A phase consists of four layers, namely:
1. Phase
2. Generic tasks
3. Specialised tasks
4. Process instance
Generic tasks are tasks that should be done for any data mining circumstance. The
specialised tasks illustrate how the generic tasks should be done in a particular
situation, while the process instance level records what occurred in a particular
deployment of a particular phase of the process model (CRISP-DM 2000).
-
15
CRISP-DM was developed in 1996, and the goal was to be “industry-, tool- and
application-neutral” (CRISP-DM 2000). A step-by-step data mining guide is
available at http://www.crisp-dm.org. The guide provides instructions for each level
of task within each phase (CRISP-DM 2000).
Cios and Moore (2002) discuss the six step DMKD process model (Figure 2.3)
which is an extension to the CRISP-DM, and is described in a later book on data
mining and knowledge discovery (Cios, Swiniarski et al. 2007) as a hybrid model,
combining facets from the CRISP-DM model and academic models.
Figure 2.3: The six-step DMKD process model (Cios and Moore 2002)
Cios et al (2007) present a comparison table comparing 5 KDD models. The
comparison demonstrates that although the number and names of steps in each model
varies, they all include understanding of the domain, preprocessing of the data, data
-
16
mining and evaluation. Data preparation is identified in all the models as the most
time consuming part of any KDD model.
KDD systems draws on research from a variety of fields, some of which are
databases, machine learning, pattern recognition, statistics, artificial intelligence and
reasoning, data visualisation and high performance computing (Fayyad, Piatetsky-
Shapiro et al. 1996). It is beyond the scope of this thesis to cover all the areas of
KDD. The main focus of this thesis is in the area of data preprocessing and data
mining, two of the steps in the overall process of KDD.
2.3 KDD and Intelligent Data Analysis
Some authors use the term Intelligent Data Analysis (IDA) when discussing the
KDD processes. According to Stacey and McGregor (2007), “KDD is primarily
concerned with learning new knowledge whereas IDA is directed toward application
of knowledge for data interpretation”. That is, IDA is utilising existing knowledge to
look for the existence of instances of that knowledge and take action for those
instances. For the research presented in this thesis, the primary interest is in
developing a framework for aiding the discovery of new knowledge, integrating the
benefits of temporal abstraction and data mining in this discovery process. However
it is also a goal that this new knowledge can be utilised in the intelligent data analysis
of real time streaming patient data, therefore any rules developed from hypotheses
created by the framework described in chapter 5 should feed back into the IDA
process described by Stacey et al (McGregor and Stacey 2007; Stacey, McGregor et
al. 2007), as depicted in Figure 2.4 below, showing the relationship between clinical
research and clinical management.
-
17
Stored
physiological &
clinical data
Monitoring devices
Clinical ResearchClinical Management
Physiological dataPhysiological data
Data validation
Temporal abstraction
Inference engine
Alerts
Temporal abstraction
Stored
abstractions
Temporal
Abstraction
Rules
Data mining
Rule
creation
Hypotheses
Domain knowledge
Inference
Rules
Alert
Rules
Rules
Figure 2.4: Relationship between clinical research and clinical management
2.4 Data Mining
One of the common parts in the various KDD processes discussed in the previous
section is the area of data mining. Data mining is the activity used in KDD for the
actual discovery of patterns and relationships in data. “Data mining involves fitting
models to or determining patters from observed data. The fitted models play the role
of inferred knowledge. Deciding whether or not the models reflect useful knowledge
is part of the overall interactive KDD process for which subjective human judgement
is usually required” (Fayyad, Piatetsky-Shapiro et al. 1996).
Data mining is used for many different problem types. The CRISP-DM data mining
guide (CRISP-DM 2000) offers the following list of problem types:
- Data description and summarisation
- Segmentation
-
18
- Concept descriptions
- Classification
- Prediction
- Dependency analysis
A variety of techniques are available to solve the different types of problems (CRISP-DM 2000):
- Clustering techniques
- Neural nets
- Visualization techniques
- Rule induction methods
- Conceptual clustering
- Discriminant analysis
- Rule induction methods
- Decision tree learning
- K-nearest neighbor
- Case-based reasoning
- Genetic algorithms
- Regression analysis
- Regression trees
- Box-Jenkins methods
- Correlation analysis
- Association rules
- Bayesian networks
- Inductive logic programming
-
19
Several techniques can be suitable for a particular type of problem (i.e., discriminant
analysis, rule induction methods, decision tree learning, neural nets, k nearest
neighbor, case-based reasoning, and genetic algorithms could all be appropriate
techniques to use for a classification problem (CRISP-DM 2000)). Similarly, a
particular technique can be suitable for more than one type of data mining problem.
Neural nets for example, is a suitable technique for segmentation, classification and
prediction problems (CRISP-DM 2000).
Data mining techniques are often described as supervised or unsupervised learning
(Bath 2004). Supervised learning involves using a training set of the data that
includes the ‘answer’ to the classification, before using a test data set without the
‘answer’ to be classified. Unsupervised learning gives no information to the data
mining tool about the classification; the data mining technique used creates the
classifications based on the data it is exposed to.
Cios and Moore (2002) considers data mining as a superset of statistics, as data
mining as well as drawing from statistics, also uses concepts from several other
disciplines, such as machine learning and database technology.
2.5 Application of Data Mining within Medicine
Heath (2006) argues that for KDD and data mining results to be accepted by
clinicians and the medical community, adaption must be made to introduce more
rigor in the form of scientific-method approach into the process, and to include
provisions for hypothesis creation and null hypothesis testing within the framework.
According to Heath (2006), clinicians are sceptical of DM results, largely because
-
20
current frameworks do not support the scientific method of Null Hypothesis testing.
The null hypothesis is usually created to be demonstrated as incorrect, in order to
support an alternative hypothesis. When used in medical experiments the null
hypothesis is typically stated as there being no significant difference between
compared groups. Null Hypothesis testing is used when conducting clinical trials,
and Heath states that “the null hypothesis driven medical research paradigm must
inform DM investigative methods in the medical domain” (Heath 2006).
Heath (2006) proposes the extended CRISP-DM model as a solution to this issue.
This model uses exploratory data mining as a tool to find unknown patterns or
relationship and creating hypotheses. Confirmatory data mining is subsequently used
for null hypothesis testing. To enable the use of the extended CRISP-DM when
discovering new trends and patterns in neonatal physiological data, a challenge exists
to further extend this model to include provisions for Temporal Abstraction (TA) and
for use on multidimensional parameters.
In the extended CRISP-DM model exploratory data mining could be performed using
a technique such as association rule mining, where the data can indicate a particular
symbolic rule in the form of IF … THEN …(Bath 2004):
IF condition(s) THEN conclusion
Or, Condition(s) ⇒ Conclusion
As pointed out by Heath (2006), these rules must be carefully considered and
analysed by domain experts before being used for prediction, and suggests the use of
rule interestingness measures for this evaluation. Rule interestingness measures is an
active research area in the field of data mining and KDD (Ohsaki, Sato et al. 2002;
-
21
Ohsaki, Sato et al. 2004; Ohsaki, Kitaguchi et al. 2005; Ohsaki, Abe et al. 2007).
Once a hypothesis has been created using exploratory data mining, a null hypothesis
can be formulated and tested using confirmatory/predictive data mining techniques
(Heath 2006).
Predictive data mining was used by Goodwin and Maher (2000) to test hypotheses
about premature birth, however there is no mention of testing null hypothesis. They
compared the results from 5 different modeling techniques; neural networks, logistic
regression, CART as well as purpose built software, PVRuleMiner and FactMiner to
aid in the understanding of the causes of premature birth. The purpose is to identify
patients at risk of giving birth prematurely and provide decision support for providers
of perinatal care.
Another issue with the use of data mining within medicine is the provision of data for
data mining where it is classed as a secondary use of data, that is, the data used for
mining is often not collected for research purposes, or collected for the purposes of
another clinical research study. This leads to issues of data ownership, protection of
patient privacy and confidentiality, and appropriate use of clinical information
(Harrison Jr 2008). “The ethical, legal, and social limitations on medical data mining
relate to privacy and security considerations, fear of lawsuits, and the need to balance
the expected benefits of research against any inconvenience or possible injury to the
patient.” (Cios and Moore 2002)
There are several other issues which make data mining of medical data unique (Cios
and Moore 2002; Harrison Jr 2008):
- voluminous heterogeneous data
-
22
- high dimensionality
- temporal patterns
- often incomplete or imprecise
- high level of noise
- missing values
- redundant, insignificant, or inconsistent data objects and/or values
- special ethical, legal, and social constraints apply to medical data
- unintuitive black box methods, like artificial neural networks, may be of less
interest (Cios and Moore 2002), as the clinicians usually would like to
understand how the result (model) was reached and would prefer
transparency (white-box) – symbolic methods
Applications of data mining in healthcare focus heavily on using predictive data
mining based on pre-defined patterns, and looking for repeating patterns in the data
stream(s) for one patient (Duchene, Garbay et al. 2007). When data mining is used
for prediction based on human-defined ‘interesting’ patterns, there is a chance of
introducing biases of the clinician/researcher working on the investigation who
determined the patterns of interest (Lavrac, Keravnou et al. 2000). Another approach
is to let the ‘data speak’, using exploratory data mining to find patterns and trends
which drive new hypotheses.
In the paper “Towards Role Based Hypothesis Evaluation for Health Data Mining”
(2006), Shillabeer and Roddick propose the use of new terminology in the area of
health data mining. They suggest that using the term rule for the results of data
mining in a health context is misleading. Instead they suggest the use of hypothesis
-
23
(Shillabeer and Roddick 2006). This thesis will use hypothesis as a description of the
outcome of the data mining preformed.
2.6 Temporal Abstraction
Time series appear in many different domains (Roddick and Spiliopoulou 2002),
such as finance, meteorology and medicine. The properties of time series vary in
terms of noise, volume and dimensionality. When analysing time series data, the
individual data values by themselves often provide little meaning; however, when
considered over time and context the values can become meaningful. This is
especially important in NICU settings, where it is usually trends or changes over
time, sometimes across multiple parameters, which are significant when analysing
and predicting patient conditions. When analysing time series using data mining
techniques in domains such as medicine, it is important to preserve the concept of
time and context. Preserving the time and context can be achieved by applying
temporal abstraction to the raw time series data before mining.
Temporal abstraction (TA) is a technique used to summarise time series data to a
higher level while preserving context and time. TA converts a time series into time
interval series (Azulay, Moskovitch et al. 2007), and depending on the method used,
can add qualitative information such as states and trends to a particular abstraction.
TA can be simple or complex, and complex abstractions can be done across multiple
parameters (see example on p.115, Figure 6.7). Temporal abstraction can be used to
convert a time series into a range of symbols for further analysis.
-
24
In medicine temporal abstraction is used to convert low level raw numeric time series
data into a higher level qualitative description which better matches the language
used by medical professionals (Stacey and McGregor 2007). Time series data from a
monitoring device attached to a baby in the NICU can be converted from a stream of
numbers to intervals labelled for example high/low/normal (state) or
increasing/decreasing/steady (trend), based on parameters set by domain experts.
Complex temporal abstractions can be created from simple abstractions from one or
more time series, relating these using Allen’s temporal relations. In his paper
Maintaining knowledge about Temporal Intervals, Allen(1983), introduces temporal
relations as a way of representing temporal patterns in time intervals. He presents
thirteen relations (Figure 2.5) that can be used to depict the relationship between time
intervals. The relations are equal, before, meets, overlaps, during, starts, finishes and
the inverse of these (except the equal relationship). Allen describes these as “a basic
set of mutually exclusive primitive relations that can hold between temporal
intervals” (Allen 1984).
Figure 2.5: The Thirteen Possible Relationships (Allen 1983; Allen 1984)
-
25
Shahar’s pivotal work presented a framework for Knowledge Based Temporal
Abstraction (KBTA) which infers abstractions based on domain-specific knowledge
stored in a formal knowledge base (Shahar 1997).
Some healthcare systems use temporal abstraction to abstract to the level of
descriptions or guidelines. Abstracting to this level enables the matching of these
abstractions for guideline execution in clinical management. An example of which is
the system developed by Seyfang et al (2001) for optimising oxygen supply for
newborn infants. Their system, which is part of the Asgaard framework (Shahar,
Miksch et al. 1998; Seyfang and Miksch 2004), uses the Asbru language (Seyfang
and Miksch 2004; Fuchsberger, Hunter et al. 2005), abstracts raw monitoring data
collected by NICU monitoring devices to the abstract concepts that are used in
therapeutic plans. The data enters the system as a stream and the high-level
abstractions derived from the raw data are compared to predefined conditions
described in the therapeutic plans.
RÉSUMÉ is a system which provides a “framework for deep knowledge
representation to perform temporal abstraction of patient data” (Stacey and
McGregor 2007). It uses CAPSUL, a temporal pattern representation language
(Chakravarty and Shahar 2000; Antunes and Oliveira 2001). RÉSUMÉ is used on
stored database data of low frequency of the abstracted parameters. The Tzolkin
architecture uses RÉSUMÉ to create abstractions (Boaz and Shahar 2003). RASTA
(A System for Temporal Abstraction) (O'Connor, Grosso et al. 2001) adds
distributed capabilities to RÉSUMÉ to allow the system to be used for more complex
settings (Augusto 2005).
-
26
In the article “PROTEMPA: A Method for Specifying and Identifying Temporal
Sequences in Retrospective Data for Patient Selection” (Post and Harrison 2007), the
authors describes the PROTEMPA method and how it has been implemented. It has
a system for implementing temporal abstractions in stored time series data, both
lower level (simple) and higher level (complex) abstractions. These abstractions are
used for identifying pre-defined patterns in the time-series data. The system has the
potential to be used in patient monitoring and decision support where the patters
being looked for are predefined, however it is not used for discovering new patterns
and relationships in the abstracted data. The temporal abstraction part of this method
could be used in the pre-processing stage of data mining for clinical research,
however the method does not include tools for mining the resulting abstractions for
new patterns and relationships. As described in Post’s doctoral thesis (2006),
"PROTEMPA is a hypothesis-testing system that scans time series data for pre-
defined mathematical and temporal patterns of interest. This strategy is in contrast to
a data mining tool that seeks to identify all meaningful patterns in a data set" (Post
2006).
Boaz and Shahar (2003) discusses the need for a temporal-abstraction database
mediator to provide a useful method for “querying not only raw data, but also its
abstractions”. In the research presented in this thesis we need to data mine the
abstractions, rather than just query them. In their paper “Idan: A Distributed
Temporal-Abstraction Mediator for Medical Databases” (Boaz and Shahar 2003) the
temporal abstraction mediator IDAN is described. IDAN uses the generic temporal
abstraction system ALMA (Boaz and Shahar 2005) for it’s temporal reasoning task,
and ALMA uses KBTA/Temporal Abstraction Rule (TAR) language (Balaban, Boaz
-
27
et al. 2003; Boaz, Balaban et al. 2003) and CAPSUL. IDAN is used by multiple
applications; KNAVE-II (Boaz and Shahar 2005) and DeGeL are examples.
Most of the current temporal abstraction frameworks used in healthcare creates
abstractions based on expert-defined rules. Verduijn et al. published a comparative
case study performed on intensive care monitoring data, extracting meta features to
be used for prediction (Verduijn, Sacchi et al. 2007). The study compares
abstractions created using domain knowledge and data driven abstractions, and found
that the data driven abstractions created more informative meta features. Open
research areas involve the exploration of data driven abstractions, as well as the
creation of frameworks integrating processes for temporal abstraction and data
mining that can be applied to any temporal abstraction and data mining task on time-
series data.
Recent research abstracts multidimensional time series data to produce alerts when
certain trends are detected (McGregor and Stacey 2007; Stacey, McGregor et al.
2007). Currently, the rules for detecting these trends are human-defined; however,
there may be as yet undiscovered trends and patterns that could indicate the onset of
some condition, found by analysing historical data. Opportunities exist to apply data
mining to temporally-abstracted cross correlated historical time series data of
previous NICU patients, to identify new patterns and trends that may be of
significance in the early identification of the onset of medical conditions in new
NICU patients. These trends and patterns can be used to create rules for clinical alert
systems within NICU monitoring equipment.
-
28
2.7 Data Mining and Temporal Abstraction
When dealing with time series data or temporal data, data mining is rarely
straightforward. Some pre-processing of the data usually needs to take place.
Antunes & Oliveira (2001) discusses some pre-processing approaches in their article
“Temporal Data Mining: an overview”, and stresses that “the representation problem
is especially important when dealing with time series, since direct manipulation of
continuous, high-dimensional data in an efficient way is extremely difficult”. Several
approaches for dealing with time-series are presented. One of the possible solutions
suggested is to use a “transformation that maps the data to a more manageable
space”. The transformation would be included in the pre-processing of the data
before data mining occurs. The paper presents several possible ways of pre-
processing the data. The approach used in this research is to use temporal abstraction
to pre-process the data before mining the abstractions created. CAPSUL and SDL are
two languages mentioned in this article as suitable to perform abstraction on time-
series data.
Temporal data mining is an important extension to data mining and is discussed in
the paper “A survey of temporal knowledge discovery paradigms and methods”
(Roddick and Spiliopoulou 2002). Many approaches to temporal data mining are
covered; however the problem of providing a flexible environment to support many
temporal data mining studies on multidimensional data streams is not discussed. The
paper discusses some interesting systems that appear to partially address the areas of
interest to the research presented in this thesis. For example, the RX project uses
temporal data to discover causal relationships. Also of interest in this paper for our
-
29
research is the discussion on sequence mining and SDL (Shape Definition
Language).
Duchene et al (Duchene, Garbay et al. 2007) has developed a prototype system to be
used in the area of home health telemedicine. Although this is a different
environment from the intensive care unit setting in terms of data rates and types of
data, the system they developed is of interest to this research due to the way the data
is pre-processed using temporal abstraction before being mined. The prototype
system is mining heterogeneous multivariate time-series of data for a patient to
discover and learn usual patterns for that particular patient. The purpose of the
system is to be able to detect changes in the pattern profile, which can indicate a
problem for the patient at home. In this system the focus is on one patient at a time.
The research presented in this thesis will extend this concept to mining across
multiple parameters for multiple patients to discover trends that can be indicative of
the onset on some condition.
To further assess the current state of knowledge in the area of temporal abstraction
preprocessing for data mining, a review of the literature was completed.
2.7.1 Review Method
The literature review phase focused on reviewing papers related to temporal
abstraction in IDA, with particular emphasis on abstraction of
multivariate/multidimensional data; and particularly those papers combining research
in both temporal abstraction and data mining as applied to clinical data. The
integration of DM and TA is a relatively new research area, therefore the majority of
-
30
papers reviewed were published in the past decade, with many written in the past
several years.
During this review searches were conducted in health informatics, medical/clinical,
computer science and engineering databases; during this effort research papers were
sourced from the following databases: Science Direct, PubMed, ACM, IEEE, and the
Web of Knowledge, as well as using Google Scholar and citation searching and
chaining. Papers were located through a number of search keywords such as
“temporal abstraction”, “temporal data mining”, “time series data mining”, “temporal
abstraction + data mining”, “integration of data mining and temporal abstraction”,
“discovering temporal association rules”, “time series analysis”, as well as
combinations of these.
2.7.2 Review Results
Table 2.1, Clinical Context, contains characteristics of the clinical environment
where the systems in each paper have been applied, as well as the characteristics of
the data discussed. For each paper the table lists the frequency of the data, whether it
is single or several parameters, real-time and/or distributed data.
Some clinical environments such as ICU deal with high frequency, high volume
clinical and physiological data from monitoring equipment (Silvent, Dojat et al.
2004; Azulay, Moskovitch et al. 2007; Moskovitch, Stopel et al. 2007; Tusch G.
2007; Verduijn, Sacchi et al. 2007), whereas others may deal with low frequency
data such as test results over time (Ho, Kawasaki et al. 2004; Abe and Yamaguchi
2005; Post and Harrison 2007).
-
31
Two papers reviewed considered both high and low frequency data in a multi-stream
environment (Azulay, Moskovitch et al. 2007; Verduijn, Sacchi et al. 2007), and one
of these (Verduijn, Sacchi et al. 2007) also considered real-time data. Three of the
remaining papers dealt only with high frequency multi-stream data (Silvent, Dojat et
al. 2004; Charbonnier and Gentil 2007; Moskovitch, Stopel et al. 2007), and the
remaining papers utilised low frequency data (Abe and Yamaguchi 2005; Abe 2006 ;
Post and Harrison 2007; Sacchi, Larizza et al. 2007; Tusch G. 2007). Only Bellazzi
et al. (Bellazzi, Larizza et al. 2005) is working with distributed data. With all the
papers reviewed, each of the papers had a particular study as the motivation for the
data collection, and hence, none of the papers created an environment for flexible
exploration to support different clinical research problems. The creation a flexible
environment to support different clinical research problems is an open research area.
Table 2.2, Data Mining and Temporal Abstraction Technique, records the types of
abstractions that each study/system is using, such as qualitative abstractions (states,
level shifts), or quantitative abstractions using various discretization methods as a
preprocessing step to data mining (Moskovitch, Stopel et al. 2007). The knowledge
base for the abstractions is listed as well as whether or not the system supports
complex abstractions. The last column lists the data mining technique that has been
used for the particular study/system. A variety of techniques were used for the
temporal abstractions. A data driven approach to temporal abstraction was used by
Azulay and Moskovitch (Azulay, Moskovitch et al. 2007; Moskovitch, Stopel et al.
2007). Verduijn et al (Verduijn, Sacchi et al. 2007) utilised qualitative temporal
abstractions to create state and trend abstractions. Sacchi et al (Sacchi, Larizza et al.
2007) uses Shahar’s (1997) knowledge based approach; KBTA. Four of the papers
-
32
discussed creation of complex abstractions (Silvent, Dojat et al. 2004; Bellazzi,
Larizza et al. 2005; Charbonnier and Gentil 2007; Post and Harrison 2007). As has
been shown, many of the temporal abstraction approaches are not data driven. The
discussion for the data driven temporal abstraction approaches are limited to the
related clinical research problem, general frameworks suitable for use for a variety of
conditions are not discussed. Alignment for condition onset prediction is also not
discussed and hence these represent open research areas.
Table 2.3, Clinical knowledge and Null Hypothesis testing, contains a column
indicating if the particular system has support for co-mining, indicating if the system
has approaches for integrating data mining and clinical reasoning, if there is support
for null hypothesis testing as discussed earlier and also indication of physician
involvement in the temporal abstraction or data mining process. Of the papers
reviewed there was none that explicitly discusses null hypothesis testing, however
one paper does discuss creating hypotheses to be evaluated (Abe and Yamaguchi
2005). As a result the incorporation of null hypothesis testing within the data mining
framework is an open research area.
Clinical Context:
# System/authors/year
Clinical Environment
Frequency Multiple streams? Real-time data? Distributed?
1 (Verduijn, Sacchi et al. 2007) Verduijn et al. 2007
ICU data 924 patients
High frequency variables measured once a minute Low frequency variables measured several times a day
High freq: mean arterial BP, central venous pressure, heart rate, TMP, FiO2, PEEP Low freq: cardiac output, base excess, creatinine kinase B and glucose value.
Yes No
2 (Tusch G. 2007) SPOT 2007
Liver transplantation followed by ICU and clinical monitoring. Note: this is presented as a
Clinical data comes in different granularities: hourly, daily, monthly, yearly
Relationship of intervals rather than single parameter values establishes the clinical concept.
Processing is not performed.
No
-
33
medical example, but in this paper the data is not used for processing.
3 (Azulay R. 2007) Azulay et al. 2007
ICU dataset of cardiac surgery patients (664 patients)
High frequency variables measured once a minute Low frequency variables measured once a day
2 types of data. 1. static data (age, gender, surgery type, >24 hrs ventilation). 2. temporal data with high frequency variables (6 streams) and low frequency variables (4 streams).
Processing is not performed on real-time data. Real-time data is collected for evaluating different discretization techniques.
No
4 (Moskovitch 2007) Moscovitch et al.
ICU monitoring data (664 patients)
Temporal data was measured each minute along the first 12 hours of hospitalization
2 types of data: 1. static data (age, gender, surgery type, >24 hrs ventilation). 2. temporal data with high frequency variables (6 streams) and low frequency variables (4 streams).
Processing is not performed on real-time data. Real-time data is collected for evaluating Morchen’s method as compared to Allen’s.
No
5 (Ho, Kawasaki et al. 2004) Ho et al.
Hepatitis laboratory data 771 patients
Not given, states that data is gathered at irregular intervals.
States that there are multiple variables – does not say how many
No – historical laboratory data. Considered ‘irregular temporal data’ since gathered from many lab tests over different periods spanning 20 years.
Before applying methods all related data is joined by the combined key of the patient ID and test date.
6 Abe and Yamaguchi (Abe 2005) (Abe 2006 )
Chronic hepatitis
Daily, weekly, monthly (+ randomized intervals based on patient’s statements)
Includes clinical blood and urine tests on chronic hepatitis B and C
No No
7 (Sacchi, Larizza et al. 2007) Sacchi et al. 2007
Predicting renal flares (acute episodes of illness) in lupus nephritis (a chronic autoimmune disease). Based on 228 patients (only 172 chosen)
Low frequency, based on parameters obtained from periodic (not regular) clinical monitoring.
Attempting to extract temporal rules to relate 4 parameters used for disease monitoring + one parameter for renal disease status. Average # of 9 measurements per patient with variable length of time series. Looked at variations in one or more variables.
No No
8 (Silvent, Dojat et al. 2005) Silvent et al. 2004
Weaning from mechanical ventilation – 8 patients
ECG, systemic arterial pressure and SpO2 Airway flow and pressure signals Signal acquisition at a sampling rate of 100Hz and resampled at 1Hz for temporal
Yes, at each of the 3 stages of weaning (over a 4 hour period) physiological data is recorded and a medical assistant annotates all alarming situations. Data included physiological parameters and device settings.
Yes No
-
34
processing. 9 (Post and Harrison
2007) Post & Harrison. 2007 PROTEMPA
HELLP syndrome (pregnancy complication in 3rd trimester) from clinical laboratory data. Data set had 761 eligible cases
Low frequency, irregular data.
Diagnosis based on 3 clinical laboratory tests
No, based on lab test results. (retrospective data)
No
10
(Bellazzi 2005) Bellazzi R et al.
Assessment of the clinical performance of haemodialysis service. Data comes from 5800 dialysis sessions from 43 patients monitored over 19 months.
Not stated. Based on haemodialysis data automatically collected during dialysis sessions. Data collected for each patient 3x week for 4 hours. Data are sequences of multidimensional time series. Based on automatic measurement of 13 variables
Yes Yes
11
(Charbonnier S 2007) Charbonnier & Gibbons 2007
Intensive care unit data collected from 8 patients collected at time of weaning from mechanical ventilation & ending of sedative drug administration.
Data recorded every second.
Yes. Article implies data is collected from range of variables monitored (obtained from monitoring devices), but does not list these devices.
Yes. Data acquisition was carried out in real-time without interference from daily care.
No.
Table 2.1: Clinical Context
Data Mining and Temporal Abstraction Technique:
System/authors/year TA Technique (algorithm) or: Temporal processing
Knowledge Base for TA defined by
Supports complex abstractions? (TA)
DM Technique (algorithm)
1 (Verduijn, Sacchi et al. 2007) Verduijn et al. 2007
Qualitative TA: concepts of state and trend are used for the abstraction (high-level description in terms of state and trend categories was derived for each time series over time intervals) Quantitative TA: derived from searching in a large space of numerical meta features..
In one case, by the experts.
Based only on basic states and trends. Not based on more complex abstractions, such as rate and acceleration.
Class probability trees (CART) used as supervised learning algorithm
2 (Tusch G. 2007) SPOT 2007
TA: using Allen’s rules (goal is to make KBTA more readily available for clinical research through the ontology)
Learned abstractions are submitted to the original database.
Not discussed. Supports R statistical DM package (open source implementation of S)
3 (Azulay R. 2007) Azulay et al. 2007
Temporal discretization (pre-processing step to DM)
Data driven
Not discussed No DM preformed, focus on pre-processing
4 (Moskovitch 2007) Moscovitch et al.
SAX: discretization
Time series are discretized based
Unsupervised TSKM mining
-
35
method designed for time series data (uses an approximate distance function that lower bounds the Euclidean distance). Order of values of data taken into account only in preprocessing stage. Persist: univariate discretization method for KD in time series, explicitly considers the order of the variables in the time series. Assumes that any time series comes from uniform sampling (not the case in slow domains that are sampled infrequently and manually)
on categories provided by an expert (this approach is compared to categories chosen from a data driven computational source)
method which results in a set of phrases
5 (Ho, Kawasaki et al. 2004) Ho et al.
Goal: determine small number of typical abstraction patterns that can be used to characterize most real sequences. Approach: combination of TA primitive computing with human inspection using visual tools and expert opinions
Create set of abstraction patterns viewed as a combination of TA primitives (so that each test sequence can be assigned to an abstraction pattern) Found 8 abstraction patterns for short-term changed tests and 22 for long term tests
Applied various DM methods to extract knowledge: • C5.0
and association rules in system Clementine (SPSS) • Their rule induction method LUPC and decision tree induction CARBO implemented in their system D2MS (Data Mining with Model Selection)
6 Abe and Yamaguchi (Abe 2005)
Pattern extraction involves first extracting sub-sequences and then clustering (they call this time series pattern extraction – do not use term TA)
Unclear Not discussed Developed a tool based on constructive meta-learning called CAMLET Taken PART implemented in Weka to evaluate improvement of pattern extraction algorithm.
7 (Sacchi, Larizza et al. 2007) Sacchi et al. 2007
Shahar’s KBTA
User defined Starting from TA representation, they run an algorithm for the extraction of temporal rules
Algorithm implements a search strategy based on apriori technique where the quality of a
-
36
expressing temporal relationships between the detected temporal patterns
rule is assessed in terms of confidence and support (described in (Bellazzi 2005) as temporal data mining)
8 (Silvent, Dojat et al. 2005) Silvent et al. 2004
Symbolic trends were computed for every parameter, with associated characteristic periods. Trend was computed using linear regression on a window whose size was determined according the dynamics of the parameter under study and called ‘characteristic span’.
Corresponds to the identification of the prior knowledge and/or extracted information necessary to drive the abstraction;
Yes. Identified two patterns that differ according to 2 thresholds on time delay and variation means, which when applied to SpO2 can be used to recognize a desaturation.
Searched for relations between complex abstractions. Identified 3 association rules (1 was evaluated by clinicians as correct). Seems like the association rule was based on only 1 parameter value
9 (Post and Harrison 2007) Post & Harrison. 2007
FW contains an algorithm source which describes how the temporal data should be processed. Low level mechanisms based on a sliding window. TAs accomplished via execution of R for Java (algorithms are written as functions in the R language)
FW contains knowledge source which defines interval relationships that define abstractions of interest. Relationships are specified as min and max temporal distances between the endpoints of participating inte