predicting emergency department revisits using machine ... · for text preprocessing, a natural...
TRANSCRIPT
Predicting Emergency Department Revisits usingMachine Learning and Natural Language Processing
Ruben Belo Matias Heitor Mendes
Thesis to obtain the Master of Science Degree in
Mechanical Engineering
Supervisors: Prof. Susana Margarida da Silva VieiraProf. Joao Paulo Baptista de Carvalho
Examination Committee
Chairperson: Prof. Paulo Jorge Coelho Ramalho OliveiraSupervisor: Prof. Susana Margarida da Silva Vieira
Member of the Committee: Prof. Ricardo Daniel Santos Faro Marques Ribeiro
June 2019
Acknowledgments
I would like to thank first Professor Susana Vieira, for her attentive advice in crucial situations.
To Eng. Marta Fernandes, who guided me through the dissertation by sharing her experience and
for being available to discuss ideas, clarify some doubts, and patiently listening to my complaints.
I would also like to express my gratitude to Hospital Beatriz Angelo for providing the data, making the
realization of this work possible.
Finally, a warm thank you to my family, who are the base of my education and values and for sup-
porting me unconditionally. To my friends, for their constant support, honesty, and motivating me in the
most difficult moments.
i
ii
Abstract
The rate of Emergency Department (ED) revisits is often considered as a measure of the quality
of care. The aim of this work is to develop a predictive model that identifies the risk of adult patients’
ED revisits, within 72 hours after discharge. The study data is from Hospital Beatriz Angelo and con-
templates 511301 patients from 2012 to 2016, with an average of 5.7% revisits. Data consists of patient
demographics, vital signs, chief main complaints, and other information available at the time of triage.
For text preprocessing, a Natural Language Processing framework is developed. For data modelling,
Logistic Regression (LR), Support Vector Machine, Multinomial Naive Bayes, and Complement Naive
Bayes techniques are considered. During model development, 10-fold cross-validation is used alongside
a cost-sensitive learning approach. The predictive power of the model is measured by c-statistic.
Five hypotheses regarding features are made. The first hypothesis considers baseline variables,
the second hypothesis considers all numerical variables, the third hypothesis uses the chief main com-
plaints, the fourth hypothesis uses variables of the first and third hypotheses, and the fifth hypothesis
considers variables of the second and third hypotheses.
The best predictive model achieves a c-statistic of 0.842 (95% CI : 0.838 − 0.846), under the fifth hy-
pothesis using the LR technique. The proposed solution shows that favourable predicting performances
can be achieved, indicating a promising way to develop clinical decision support systems to predict
patient ED revisits within 72 hours after discharge.
Keywords: Natural Language Processing, Machine Learning, Emergency Department, Triage, Prediction,
Revisits.
iii
Resumo
A taxa de readmissao ao Servico de Urgencia (SU) e uma medida de qualidade dos cuidados de
saude. O objetivo deste trabalho e desenvolver um modelo preditivo capaz de identificar o risco de
readmissoes de doentes adultos ao SU, ate 72 horas apos alta. Os dados provem do Hospital Beatriz
Angelo e contemplam 511301 doentes, desde 2012 ate 2016, com uma media de 5.7% readmissoes. Os
dados consistem em informacoes demograficas, sinais vitais, queixas principais e outras informacoes
disponıveis aquando da triagem.
Para o pre-processamento de texto, uma estrutura de processamento de linguagem natural e desen-
volvida. Para modelacao de dados, consideraram-se as estrategias: Regressao Logıstica; Maquina de
Vetores de Suporte; Naive Bayes Multinomial; Naive Bayes Complementar. Para desenvolvimento do
modelo utiliza-se validacao cruzada com 10 subconjuntos juntamente com uma abordagem de aprendi-
zagem sensıvel ao custo. O poder preditivo do modelo e medido recorrendo a estatıstica-C.
Sao criadas cinco hipoteses relativamente as variaveis. A primeira hipotese considera variaveis
padrao, a segunda hipotese considera todas as variaveis numericas, a terceira hipotese recorre as quei-
xas principais, a quarta hipotese utiliza variaveis da primeira e terceira hipoteses, e a quinta hipotese
considera variaveis da segunda e terceira hipoteses.
O melhor modelo preditivo atinge uma estatıstica-C de 0.842(95%CI : 0.838 − 0.846), sob a quinta
hipotese, utilizando regressao logıstica. A solucao proposta mostra que desempenhos de previsao fa-
voraveis podem ser alcancados, indicando um caminho promissor para desenvolver sistemas de apoio
a decisao clınica de modo a prever readmissoes ao SU ate 72 horas apos a alta.
Palavras-Chave: Processamento de Lınguagem Natural, Aprendizagem Automatica, Servico de Urgencia,
Triagem, Predicao, Readmissao
v
Contents
Acknowledgments i
Abstract iii
Resumo v
List of Tables ix
List of Figures xii
List of Acronyms xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Predictive Models for Emergency Departments . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Machine Learning 9
2.1 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Overfitting and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
vii
2.4 Model Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Bootstrap Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.3 McNemar’s Hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Natural Language Processing 27
3.1 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Word Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.1 Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6.2 Term Frequency-Inverse Document Frequency . . . . . . . . . . . . . . . . . . . . 37
4 Methodology 39
4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.1 Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.3 Word Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.4 Word Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Results 57
5.1 Database Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Conclusion 75
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
viii
6.2 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Bibliography 79
A Results 89
A.1 Baseline Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.2 All Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.3 Textual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.4 Textual and Baseline Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
ix
x
List of Tables
2.1 Dummy example for the development of a Contingency Table. . . . . . . . . . . . . . . . . 25
2.2 Contingency table derived from the dummy example. . . . . . . . . . . . . . . . . . . . . . 25
3.1 Regular Expression (RE) metacharacters description. . . . . . . . . . . . . . . . . . . . . 30
3.2 Dummy Example for Regular Expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Special Sequences description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Bag of Words representation for the dummy example. . . . . . . . . . . . . . . . . . . . . 37
3.5 Term Frequency of each unigram to its corresponding sentence for the dummy example. . 38
3.6 Inverse Document Frequency (idf) of each unigram for the dummy example. . . . . . . . . 38
3.7 Term Frequency - Inverse Document Frequency of each unigram to its corresponding
sentence for the dummy example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Regular Expressions (RE) used to capture different date components when date is given
by a day, a month and a year. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Regular Expressions (RE) used to capture different incomplete date components when
date is given by a month and a year. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Regular Expressions (RE) used to capture different incomplete date components when
date is given by a day and a month. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Outputted normalized date format for each normalization step. . . . . . . . . . . . . . . . 45
4.5 Regular Expressions (RE) used to match temporal references given by months, weeks,
days, hours, minutes and seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6 Partition of the day into six hours day periods. . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.7 Hyperparameters for each learning strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Predictor variables and outcome used for modelling Hospital Beatriz Angelo (HBA) ED data. 58
xi
5.2 The uncertainty associated with the candidate models under the All Numeric Hypothesis. 65
5.3 The uncertainty associated with the candidate models under the Textual Hypothesis. . . . 65
5.4 The uncertainty associated with the candidate models under the Textual and Baseline
Hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5 Comparison of the uncertainty between candidate models with similar error proportions
under the Textual and Baseline Hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6 The uncertainty associated with the candidate models under the Textual and All Numeric
Hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.7 McNemar’s Hypothesis Test results with respect to pairs of machine learning models that
failed to reject the null hypothesis given a significance level of 5% for the 5 predictor sets. 68
5.8 Results for the machine learning models in test using the main chief complaint and all of
numerical variables with their respective hyper-parameters. Results for the performance
metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ]. . . . . . . 70
6.1 Best results for each hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.1 Results for the machine learning models in test using baseline numerical features with
their respective hyper-parameters. Results for the performance metrics are illustrated as
Performance [ 95%Bootstrap Confidence Interval ]. . . . . . . . . . . . . . . . . . . . . 90
A.2 Results for the machine learning models in test using all of the numerical features with
their respective hyper-parameters. Results for the performance metrics are illustrated as
Performance [ 95%Bootstrap Confidence Interval ]. . . . . . . . . . . . . . . . . . . . . 92
A.3 Results for the machine learning models in test using only the main chief complaint with
their respective hyper-parameters. Results for the performance metrics are illustrated as
Performance [ 95%Bootstrap Confidence Interval ]. . . . . . . . . . . . . . . . . . . . . 94
A.4 Results for the machine learning models in test using the main chief complaint and base-
line variables with their respective hyper-parameters. Results for the performance metrics
are illustrated as Performance [ 95%Bootstrap Confidence Interval ]. . . . . . . . . . . 100
xii
List of Figures
2.1 Different types of machine learning techniques. . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Comparison between the hard-margin SVM and the soft-margin SVM for a toy set. . . . . 16
2.3 Empirical illustration of the fitting-stability tradeoff. . . . . . . . . . . . . . . . . . . . . . . 20
2.4 AUC-ROC Curve for a dummy example using a LR model. . . . . . . . . . . . . . . . . . . 22
3.1 Partition of a general language into specialized languages. . . . . . . . . . . . . . . . . . 28
3.2 Flowchart for the RLSP Stemmer. Adapted from [93]. . . . . . . . . . . . . . . . . . . . . 36
4.1 Overview of the Methodology steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Overview of the Natural Language Processing framework used to improve the quality of
the main chief complaint data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Normalization sequence steps when dealing with dates. . . . . . . . . . . . . . . . . . . . 42
4.4 Representation of the used pipelines during the hyper-parameter optimization step with
respect to the used predictors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Representation of the k-fold cross validation approach. . . . . . . . . . . . . . . . . . . . . 52
4.6 Representation of the used bootstrapping strategies during validation and testing steps. . 55
5.1 Flow diagram outlining the inclusion and exclusion criteria. . . . . . . . . . . . . . . . . . . 58
5.2 Univariate Analysis of age groups and gender with respect to the outcome. Age Groups
according to [103–105]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Pearson Correlation between features, described in table 5.1, and the outcome. . . . . . . 62
5.4 Comparison between the top 20 most frequent N-Grams (N = 1, 2, 3), before stemming,
for each label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Comparison between the performance of the Machine Learning models according to each
of the hypotheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
xiii
xiv
List of Acronyms
AI Artificial Intelligence. 9, 27, 78
AUC Area Under Curve. xiii, 5, 6, 21, 22, 51, 54, 64–67, 75–77
BoW Bag of Words. 6, 35, 37, 38, 51, 65–67, 75
CD Coordinate Descent. 11, 17, 53
CNB Complement Naive Bayes. iii, 14, 51, 53, 64–66
ED Emergency Department. iii, xi, 2–7, 57–61, 63, 65, 67, 75–78
FN False Negative. 22
FP False Positive. 22
FPR False Positive Rate. 21, 22
LASSO Least Absolute Shrinkage and Selection Operator. 20, 21
LR Logistic Regression. iii, xiii, 4–6, 11, 12, 15, 22, 52, 53, 64–67, 76, 77
ML Machine Learning. 1, 3, 4, 6–9, 15, 21, 24, 35, 75–78
MNB Multinomial Naive Bayes. iii, 6, 13, 14, 51, 53, 64–67
NB Naive Bayes. 4, 12, 13
NLP Natural Language Processing. iii, 1, 3, 6–8, 27, 28, 31–33, 35, 40, 50, 75
RE Regular Expression. xi, 29, 30, 42–48
RLM Regularized Loss Minimization. 20
ROC Receiver Operating Characteristic. xiii, 5, 6, 21–23, 51, 54, 64–67, 75–77
SGD Stochastic Gradient Descent. 11, 17, 53, 64–66, 76, 77
xv
SVM Support Vector Machine. iii, xiii, 4, 6, 15–17, 52, 53, 64–67, 77
tf-idf Term Frequency-Inverse Document Frequency. 14, 37, 38, 51, 65–67, 75–77
TN True Negative. 22
TP True Positive. 22
TPR True Positive Rate. 21, 22
xvi
Chapter 1
Introduction
Nowadays, in the era of Big Data, the amount of digitalized data that is being produced and stored
is evolving exponentially [1]. Despite the production rate of data, Big Data has not been used to its full
potential on the field of health informatics, mainly due to privacy policies of sharing private patient data,
turning the availability of large clinical datasets to researchers an issue [2]. Notwithstanding the privacy
issues, Big Data has begun to assume a significant role supporting the healthcare research, and can
be exploited for patient care by means of risk-stratification and provide assistance for the decision of
medical treatment to be administered, which may lead to cost reduction regarding healthcare facilities
and improve patient outcomes [3].
The generalized adoption of electronic health records has generated a massive quantity of data.
These datasets comprehend heterogeneous data since they contain quantitative (laboratory results),
qualitative (textual notes and demographics) and transactional (billing records) data. Even though this
data is very rich in information it is estimated that 80% of data present in electronic health records is
unstructured [4]. Despite the large proportion of text-based data, it is common for this type of data
to remain untapped after being created. This unfortunate practice is mainly due to the difficulties on
handling unstructured data resulting in ignoring or abandoning this type of data in most healthcare
facilities [5]. Recent advances in Machine Learning (ML) have provided a notable impetus for managing
these datasets and, contrary to traditional statistical methods, demonstrate to be useful when analyzing
unstructured data, such as the chief main complaint [6].
In the biomedical domain, one can partition textual data into two narratives: Biomedical and Clinical
text. The former appears in books, articles, abstracts, and other meticulously treated textual sources that
are written to convey research results. The latter is written by healthcare providers in order to document
clinical events and enable an easier communication among healthcare providers, through description
of a patient status by means of concepts. One way to represent clinical knowledge from clinical textual
sources is by means of Natural Language Processing (NLP) which encodes the clinical information
present in the textual documents. Nevertheless, clinical text presents various challenges in the NLP
1
field. These textual data sources are not meticulously treated like the ones of the biomedical narrative.
The lack of rigor results in short ungrammatical documents, without a standard terminology, that are
overloaded with abbreviations, locally developed terms, and other shorthand textual characteristics, that
are intentionally written for an easier communication among healthcare providers [7].
1.1 Motivation
Healthcare facilities are very complex and stressful environments, with great demand, where de-
cisions need to be taken quickly. EDs are responsible for the provision of emergency medicine and
surgical care to patients arriving at the hospital in need of immediate care. Due to unplanned patient
attendance to the ED, arriving either by patient’s own means or by ambulatory service, the ED must
provide initial treatment for a wide-ranging of illness and injuries, where some may require immediate
treatment. In order to differentiate patients according to severity, patients are subject to triage when they
attend the ED. Commonly, the ED operates 24 hours a day and the level of healthcare providers may
be different according to the time period or year season, in an attempt to provide the best possible care
according to patient volume.
Presently, ED services have a high demand, which is increasing, and are subject to patient revisits
to the ED. This situation leads to overcrowding within the ED which may increase the length of stay of
patients resulting in stressful situations. Revisits are often considered as a measure of healthcare quality
which consumes medical resources and may represent issues regarding patient safety. By predicting
the risk associated with whether a patient will visit the ED again, within 72 hours after being discharged
from the ED, clinical interventions may be taken in order to prevent revisits since the knowledge of a
patient belonging to a risk group may guide healthcare providers to reevaluate a patient status before
being discharged from the ED.
There are several studies focused on detecting risk factors that may lead to an ED revisit, including
diagnosis, main chief complaints discriminators, and patient demographics [8]. The abdominal pain and
fever main chief complaint discriminators where the most common causes for patient revisit in [9]. In [10]
and [11], the main reason for revisiting the ED was due to misdiagnosis, where [10] also concluded that
fatigued healthcare providers provided inadequate medical care. [12] focused on diagnostic predictors
and concluded that Dehydration, Septicemia, Abdominal pain, Seizure, Asthma, Urinary tract infection,
and Pneumonia were the more useful initial diagnosis predictors of an ED revisit. Despite the amount
of studies dedicated in identifying risk-factors related with ED revisits, a very limited number of studies
regarding their prediction were carried out.
Since ED revisits cause unnecessary overcrowding in the facility, and crowding within the ED results
in significant negative consequences for both patients and the healthcare facilities [13–15], being capa-
ble of predicting ED revisits helps decreasing the rate of ED revisits. This results in: Better management
of the ED resources; Overcrowding prevention; Medical care cost reduction; Patient satisfaction regard-
2
ing the provided service; Fatigue management of the healthcare providers (less stressed and exhausted
staff); Enhance quality of emergency care.
Most computer-based algorithms in healthcare are rule-based expert systems that encode knowl-
edge on a particular field, whose goal is to provide the healthcare providers some conclusions about
a specific clinical scenario. These systems have as basis the principles of medicine. One example in
healthcare is the LACE index, which provides the 30-day readmission risk of a patient. The computation
of the LACE index is fairly simple, it is based on length of stay, acuity of the admission, patient comor-
bities, and ED visits within the last six months [16]. There exist similar tools like the sequential organ
failure assessment score, known as SOFA score, as it is computed using Partial Pressure of Oxygen,
Fraction of inspired Oxygen, Platelets, Glasgow Coma Scale, Bilirubin, Mean arterial pressure and Cre-
atinine values and knowing if a patient is under mechanical ventilation. This system predicts the clinical
outcomes of critically ill patients and is used to track the status of a patient during the stay in an intensive
care unit to determine the organ rate of failure of a patient [17].
Given the existence of such expert systems, one may argue that the development of ML models
is unnecessary. The LACE index was developed using a cohort from 2004 to 2008 based on Ontario,
Canada demographics. In order to successfully use this scoring system one should know if the patient
demographics being studied closely matches those used from Ontario. Another issue regarding the
LACE score is its computation requirements. LACE requires the length of stay of a patient, meaning that
this score is only applicable after patient discharge. This is where ML excels since it learns the important
relationships from patient observations that reliably predict outcomes. This means that ML models are
customized on the provided data and the designer has control on the used data when developing risk
prediction models. E.g., data can be used upon patient admission rather than discharge. ML also fills
the gap on handling an huge amount of predictor variables allowing the use of new kinds of data.
Recently, the Institute of Medicine conducted a study and concluded that the frequency of diagnostic
errors is alarmingly high and interventions to reduce such errors is minimal [18]. With the increase
of electronic health records, ML algorithms will suggest high-value tests, and reduce overuse of testing
causing cost reductions in healthcare facilities. This will be developed at a slow pace since the developed
ML models need to be built, and validated individually, for each clinical case and the majority of data is
unstructured making them inaccessible to algorithms without prior preprocessing and transformations.
This is where the use of NLP comes trough, making possible the extraction of knowledge of high-valued
textual data in electronic health records, enhancing the risk prediction performance.
3
1.2 Predictive Models for Emergency Departments
In the literature we may find some related work dedicated to the development of predictive models
focused on EDs that used information at the time of triage. The work of [19] was focused on developing
a clinical decision support system whose objective was the identification of discriminatory characteris-
tics that can predict Pediatric ED revisits, within 72 hours after a patient being discharged. Initially, the
authors considered 96 factors which included the corresponding discriminator of the main chief com-
plaint, physician diagnosis, 5 factors with respect to patient demographics, 8 factors regarding the way
of patient arrival to the Pediatric ED, 35 factors related to the hospital environment, and 44 factors with
respect to medical treatment. The ML framework focused on the wrapper approach, where feature
selection was used by means of particle swarm optimization techniques. This was combined with an
optimization-based Discriminant Analysis via Mixed Integer Program model, named DAMIP, to generate
classification rules, based on small subsets of discriminatory factors, that can be used to predict Pe-
diatric ED revisits. Cross-validation was considered so that the algorithm searched trough the subsets
of discriminatory factors space. This work resulted in a model whose predictive accuracy is over 80%
and the most important factors for a Pediatric ED revisit were: Diagnosis; Discriminator ; The type of
Provider.
The previous work paved the way for future research regarding the ED revisits case study. In [20]
the main difference between the objective of this work and the previous one is that the present one
focused on the general ED. The authors considered 44 factors, of which 22 features represented the
diagnosis and were binary, indicating if the diagnosis in a certain category was given. The remaining
ones, described in the work, included age, gender, arrival mode, urgency level, and length of stay. The
ML framework also focused on the wrapper approach, where three feature selection techniques were
explored: Particle swarm optimization (as the previous work); Tabu search; Random search. The feature
selection technique was combined with ML algorithms where the authors also considered the DAMIP
approach for the generation of classification rules. For comparison purposes they also considered linear
discriminant analysis, NB, SVM, LR, nearest shrunken centroid, and neural network algorithms. During
model training, under-sampling of the majority class (No Revisit) was performed to improve the prediction
of the minority class (Revisit). This research concluded that from the different classification strategies
only the NB and nearest shrunken centroid were capable of classifying some patients from the minority
class. The sensitivity, specificity, precision, and F1-Score for the NB and nearest shrunken centroid
models were given by 28.7%, 86.4%, 9.9%, 14.8%, and 67.5%, 48.4%, 6.4%, 11.7%, respectively. Given
the low scores the authors suggest that these models should not be considered by healthcare providers
as a clinical decision support system. Among the three feature selection approaches, the one that
showed better results was the metaheuristic tabu search coupled with the DAMIP model. From the
44 initial features only 19 were chosen as predictors but the authors did not specify which ones were
selected and the DAMIP model with Tabu search feature selection generated a model with a sensitivity,
specificity, precision, and F1-Score of 69.0%, 62.7%, 8.9%, 15.7%, respectively.
4
Similarly to the previous studies, [21] also focused on developing a clinical decision support system
whose objective was the identification of discriminatory characteristics that can predict adult ED revisits,
within 72 hours after a patient being discharged. The authors considered about 140 factors including:
Patient demographics; Chief complaints; International Classification of Diseases, 9Th Revision, Clinical
Modification descriptions; Medical history; Laboratory tests; Handover; Vital signs; Language; Arrival
mode; Admission time; Admission date; Acuity category; Social issues. The patient demographics in-
cluded age, gender, ethnicity, nationality, and residential status. The chief complaints were classified in
18 categories, i.e. the discriminator associated to the main chief complaint was used. The acuity cate-
gory was characterized using four levels. The number of handovers only considered doctors. Medical
history consisted on all known medical conditions a patient had. Social issues considered generalized
weakness, severe social and financial issues, homeless, family violence, among others not described
in the study. In this study, a pre-selection of risk factors was performed using univariate analysis in
order to derive statistically significant factors. Afterwards, feature selection using filter methods was
performed, on the set derived from univariate analysis, to generate sets of risk factors. The authors
explored different filter methods such as joint mutual information, mutual information feature selection,
conditional mutual information maximization, (max-relevance min-redundancy, interation capping, con-
ditional infomax feature extraction, double input symmetrical relevance, conditional mutual information,
and conditional redundancy. For each filter method, the top 15 factors were considered from which 9
sets of selected factors were obtained in order to obtain the 10 most important factors. As in the two
previous works, DAMIP was considered to generate classification rules and for comparison reasons LR
was used, where 10-fold cross validation was used during model training. This study resulted in a DAMIP
model with an overall accuracy and sensitivity of 72.6% and about 40%, respectively. Regarding the LR
model, the ROC-AUC score was 66.0% and several results of sensitivity and specificity were presented,
according to several cut-off probabilities. When the 5% probability cut-off was considered, the accuracy
regarding sensitivity and specificity was 51.0% and 70.8%, respectively, resulting in an overall accuracy
of 69.9%. For a probability threshold of 70% the accuracy regarding sensitivity and specificity was 1.0%
and 99.9%, respectively, resulting in an overall accuracy of 95.5%. Among the initial 140 factors the 10
most important factors in this study, sorted from most to least important, were: Social issues; Chronic
obstructive pulmonary disease; Substance misuse - Chief Complaint Discriminator; Neurotic disorders;
Open wound lower limb; Handover; Respiratory - Chief Complaint Discriminator; Open wound upper
limb; Fracture upper limb; Open wound of head, neck, and/or trunk.
In the work of [22], the objective was the development of a clinical decision support system to identify
patients in group-risks of revisiting the ED within 72 hours after discharge. The used data came from
Veterans Healthcare Network Upstate New York and consists on: Demographics - age, gender, maritial
status, race, period of military service, and disability rating; Socioeconomic - income, homeless, insur-
ance status; Prior ED utilization - ED revisit within 72 hours, ED revisit within 30 days, number of ED
visits, number of primary care visits, number of tele-health encounters, total outpatient visits, number of
hospitalizations, total cost; Comorbidities - 285 clinically homogeneous groups. The adopted modelling
algorithm was the multivariate LR and the final model was developed with stepwise methods variables
5
with p-values less than 0.05 to avoid overfitting. For model validation the authors used the split-sample
method where 23 of the data was used for model development and the remaining for validation. The
authors considered three sets of predictors to demonstrate model predictive power according to each
considered hypothesis regarding independent variables. The first hypothesis considered demographic
and socioeconomic variables. The second hypothesis used the variables of the first hypothesis and the
Prior ED utilization variables. The third, and final, hypothesis considered all variables. Model predic-
tive power was measured using ROC-AUC whose values, in validation, for the first, second, and third
hypothesis are 0.54, 0.70, and 0.73, respectively.
Despite the number of published works regarding ED revisits being little, there are several studies
that consider the ED and data produced from this environment. The works of [23] and [24] used triage
and patient demographics information to develop ML models to predict hospital admissions at the time
of ED triage. Both studies considered LR and Extreme gradient boosting algorithms. The former also
considered deep neural networks, the latter also took into account artificial neural networks, decision
trees, random forests, and SVM (linear kernel) algorithms. Still focused on prediction models to predict
hospital admission at the time of triage, the works of [25] and [26] differ from the two previous ones since
these works considered unstructured data and developed NLP frameworks to preprocess the textual
data in order to be used as a predictor for model development. Both works considered the LR algorithm.
The former also used multilayer neural network models, whereas the latter considered the decision trees,
random forests, extremely randomized tree, AdaBoost, MNB, SVM (linear kernel), and Nu-SVM (linear
kernel) models. The works of [27] and [28] had as objective the development of ML prediction models of
patient mortality using information available at ED triage. The former used multivariate LR and the latter
considered the random forest and classification and regression tree algorithms.
Textual data coming from the ED has several work dedicated to it. Resorting to NLP techniques ap-
plied to the main chief complaint textual data, the work of [29] had as objective classifying the free-text
present in the main chief complaint into seven syndromic categories: Respiratory; Botulinic; Gastroin-
testinal; Neurologic; Rash; Constitutional; Hemorrhagic. The model was based on the M+ system
which is a robust chart-based semantic model with a Bayesian network based semantic model for ex-
tracting information from narrative patient records [30]. The developed system was capable to, for most
syndromes, classify about half of the patients main chief complaint into syndromic categories with speci-
ficities higher than 90% for all syndromic categories, and sensitivities ranging from 12% to 44%. The work
of [31] focused on coding the main chief complaints according to 228 possibilities, i.e. the corresponding
discriminator related with the main chief complaint. The authors developed the Coded Chief Complaints
for Emergency Department Systems classification schema which is a text-parsing algorithm that reads
the main chief complaint and classifies it into 1 of the 228 coded possibilities. In [32] a system capable
of predicting the main chief complaint of a patient according to patient status, i.e. vital signs and state at
ED arrival, was developed. The system consists on a linear SVM trained using the BoW, of triage notes,
approach where negation was taken into account (NegEx [33]; A NegEx-like system where some rules
were added; A perceptron classifier.) so that the model did not consider those negated complaints as
6
candidates of being the main chief complaint. In [34] focus was given into building a concept-oriented
terminology regarding nursing language entries since a standard vocabulary does not exist for the chief
main complaint. The authors used NLP techniques and the Unified Medical Language System [35]. The
used NLP techniques consisted on normalization, tokenization, stemming, word sense disambiguation,
word lookup, spelling correction, and abbreviation expansion.
1.3 Objectives and Contributions
The first goal of this study is to advance knowledge within the healthcare area by applying ML meth-
ods to healthcare data, structured (demographics, vital signs, dummies) and unstructured (chief main
complaint). This data is used in order to develop a predictive model that identifies the risk of adult pa-
tients’ ED revisits, within 72 hours after discharge. While identifying these type of patients, decisions
regarding their care can be taken, in a timely manner, to reduce the risk of patient revisit, improve pa-
tient safety, and reduce the healthcare facility cost. Given this, five hypotheses regarding the predictors
were made. The first consists on using Baseline triage variables. The second uses the variables of the
first hypothesis with some additional variables (named All Numeric). The third hypothesis,Textual, only
considers the chief main complaint. The fourth uses variables from the first and third hypotheses and
is named Textual and Baseline. The fifth and final hypothesis, Textual and All Numeric considers all
the variables, i.e. variables from the second and third hypotheses. For a full description of the variables
the reader is suggested to table 5.1.
The second goal is to see if the chief main complaint has relevant information that leads to the
indication that a patient belongs to a group-risk, since for gaining knowledge about the causes for a
patient ED revisit, it is important that the predictive model identifies factors that are critical for prediction.
The third goal is to find what is the best pair of ML model to use for predicting patient ED revisits
coupled with the best textual feature extraction technique.
To cover all aspects, three research questions (RQ) were formulated:
• RQ1 - Does the textual data increase the revisits prediction power?
• RQ2 - Which ML model and textual feature extraction technique are most suitable for predicting
those risk-group patients?
• RQ3 - Which features best describe the risk of ED revisits?
Regarding the contributions, to the best of the author’s knowledge, this is the first study that focuses
on developing a prediction model whose goal is to predict the risk of adult patients ED revisits resorting
to ML techniques to ED triage data and applying NLP on the chief main complaint to use it as a predictor
for the development of prediction models.
7
1.4 Thesis Outline
This document is composed by 6 chapters including the present one. The remaining document is
organized as follows:
Chapter 2 addresses ML algorithms and how to evaluate such ML models. Some explanation re-
garding concerns when working with ML is also provided.
Chapter 3 describes the difficulties when dealing with (sub)language and explains, with several
examples, techniques to overcome them.
Chapter 4 presents in detail the chosen methodology in this work. It starts by describing the pre-
processing of textual data, the developed NLP framework, and at the end an example using real data
is presented. The chosen data modeling strategy follows and the chapter ends describing how models
were evaluated and selected.
Chapter 5 starts with the description of the database and how it was handled in order to be used.
Afterwards, a summary of the main results is provided.
Chapter 6 summarizes all the methodologies implemented throughout the work, it presents the main
conclusions drawn from it and finally it exposes the recommendations for further study that can be
performed.
8
Chapter 2
Machine Learning
Arthur Lee Samuel (1901-1990), considered one of the pioneers in the field of Artificial Intelligence
(AI) [36], conceived one of the earliest definitions of Machine Learning (ML) stating, in [37], the following:
”A computer can be programmed so that it will learn to play a better game of checkers than can be
played by the person who wrote the program.”
Tom Michael Mitchell proposed a more formal and broad interpretation of learning: ”A computer
program is said to learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with experience E.” [38]
Combining these two definitions one can interpret ML as a computer science theory that provides
systems the ability to automatically learn and to progressively improve performance from experience
without being explicitly programmed.
The process of learning begins with observations, in order to look for patterns in data, by means of
statistical techniques, and make better decisions in the future based on the examples that we provide.
The primary aim is to allow the computers learn automatically without human intervention and adjust
actions accordingly.
ML approaches are usually partitioned in three different types: Supervised, Unsupervised and Re-
inforcement Learning. Figure 2.1 illustrates those partitions and the different problems tackled by each
ML technique.
Supervised Learning methods are techniques that pursue to find the relationship between input vari-
ables, also referred as features or independent variables, and a target attribute, typically mentioned as
outcome or dependent variable, that can be either discrete or continuous. The mapping function can be
generally described as y = f(x), where y is the dependent variable and x is the independent variable
and depending on the function approximation task one can be dealing with a classification or regres-
sion problem. The main difference between classification and regression tasks falls into the type of the
outcome, whether it is discrete or continuous. If the outcome is discrete the problem in hand is a clas-
9
sification problem and we can rename the outcome as class, where these classes are pre-defined. The
mapping function f(x) predicts the class for a given set of features, usually the prediction is given by a
continuous value indicating the probability of a given observation belonging to each class. A predicted
probability can be converted into a discrete class value by selecting a probability threshold value that
decides between class labels probabilities.
MachineLearning
ReinforcementLearning
UnsupervisedLearning
SupervisedLearning
Classification Regression Clustering Density Estimation
Model Based TemporalDifference Policy
Figure 2.1: Different types of machine learning techniques.
If the outcome is continuous the task being handled is of the regression type and normally the
outcome describes a quantity. The mapping function predicts a real-value continuous quantity for a
given observation.
Unsupervised Learning differs from Supervised Learning since the only known data are the ob-
servations, i.e., all data is unlabelled. The goal of Unsupervised Learning is to model the underlying
distribution in the data in order to learn more about it and extract more knowledge. According to [39]
the most commonly used strategies are: Cluster Analysis [40], Data Compression [41] and Anomaly
Detection [42].
Reinforcement Learning concerns in developing intelligent agents capable of learning to achieve
a certain goal, in an uncertain and likely complex world, through a sequence of actions in order to
maximize its reward [43]. The designer of such agent sets the reward policy and settles the trade-off
between exploration and exploitation so the agent takes sub-optimal actions in order to correct the policy
[44].
2.1 Classification Algorithms
This section explains some of the most used Supervised Learning algorithms in healthcare for clas-
sification of dichotomous data. Firstly the probabilistic approaches, both discriminative and generative
models [45], will be described, followed by a similarity-based approach, a margin model.
10
2.1.1 Logistic Regression
LR is a discriminative log-linear probabilistic type of modelling where the outcome is a categorical
variable. It is assumed that the probability of each class can be constructed from a linear combination
of the features describing the observation.
P [Class = 1 | X = x] =1
1 + e−w>φ(x)(2.1.1)
This method is named after its core function, the logistic function, equation 2.1.1, which models
the probability that the binary outcome is a function of a vector of N predictor variables describing
the observation φ(x), where φ(x) = [φ1(x), φ2(x), . . . , φN (x)]> and regression coefficients, sometimes
referred as weights, w given by w = [w1, w2, . . . , wN ]> where each weight indicates how much each
feature contributes to the prediction of each class Class. The weight vector can be directly optimized
using any local search method, such as Coordinate Descent (CD) and Stochastic Gradient Descent
(SGD).
This method loss is represented in equation 2.1.2, which is the negative log-likelihood.
l(x, y, f) = − log (f(y | x)) (2.1.2)
Given equation 2.1.1 one can easily compute the logistic function for the other outcome noting the
following:
P [Class = 0 | X = x] = 1− P [Class = 1 | X = x] =e−w
>φ(x)
1 + e−w>φ(x)=
1
1 + ew>φ(x)(2.1.3)
With the result of equation 2.1.3 and equation 2.1.1 we can rewrite equation 2.1.2 as expressed in
equation 2.1.4.
.
l(x, y, f) = log(e−yiw
>φ(xi) + 1)
(2.1.4)
In a LR model, the inherent inductive bias is related with the assumed form of the dependence of the
output on the features. In particular, LR is a type of log-linear model, in which the (log-) output depends
linearly on the features describing the observation.
For the LR model the decision boundary, assuming a probability threshold of 0.5, is given by the
solution to the equation P [Class = 0 | X = x] = P [Class = 1 | X = x], which yields equation 2.1.5.
w>φ(x) = 0 (2.1.5)
11
If we instead consider another probability threshold, called t for generalization purposes, then the
decision boundary, given by equation 2.1.6, is the result of the equation P [Class = 1 | X = x] = t.
w>φ(x) = − ln
(1− tt
), 0 < t < 1 (2.1.6)
Given a training set D = {(xn, yn), n = 1, 2, ..., N}, where x represents a data point and y the
respective Class, the Empirical Risk which represents a proxy of the true risk for the LR model is given
by equation 2.1.7, corresponding to the negative log-likelihood of the observed data. According to the
Empirical Risk Minimization principle [46], it is reasonable to minimize this quantity since the underlying
distribution µD is unknown. However the Empirical Risk provides no information about how well a given
mapping function generalizes beyond the training data.
RLR =1
N
N∑i=1
log(e−yiw
>φ(xi) + 1)
(2.1.7)
2.1.2 Naive Bayes
The Naive Bayes (NB) classifier is a simple probabilistic model based on applying Bayes theorem
with strong and naive assumptions of conditional independence between each pair of features. Unlike
the discriminative methods these type of models implicitly estimate the underlying distribution µD making
them generative models. Despite the fact that this assumption is usually false, there are some theoretical
reasons for the apparently unreasonable efficacy of NB classifiers [47]. Another problem concerns the
quality of the probability estimates, but, as pointed out in [48], despite the probability over estimation the
decision making only concerns on which probability is bigger in order to classify an observation as being
either Class 0 or 1.
P (y | φ1(x), . . . , φN (x)) =P (y)P (φ1(x), . . . φN (x) | y)
P (φ1(x), . . . , φN (x))(2.1.8a)
P (φi(x) | y, φ1(x), . . . , φi−1(x), φi+1(x), . . . , φN (x)) = P (φi(x) | y) (2.1.8b)w�P (y | φ1(x), . . . , φN (x)) =
P (y)∏Ni=1 P (φi(x) | y)
P (φ1(x), . . . , φN (x))(2.1.8c)
According to Bayes theorem, the relationship between given class independent variable y and fea-
ture vector φ(x) = [φ1(x), φ2(x), . . . , φN (x)]> is given by equation 2.1.8a, where P (y) is the prior,∏N
i=1 P (φi(x) | y) the likelihood and P (φ1(x), . . . , φN (x)) the evidence. As already stated, it is assumed
independence between each feature φi(x), i = {1, . . . , N}. Replacing the conditional probability in the
12
second term of equation 2.1.8a with the result of equation (2.1.8b) the relationship between y and φ(x)
is thus simplified to equation 2.1.8c. Taking into account that the evidence does not depend on the type
of class and that the values of the features are known the evidence is constant. Given this, this term
can be discarded from equation (2.1.8c) resulting in equation 2.1.9a. The decision rule is thus given by
equation 2.1.9b, using the maximum a posteriori estimation, that chooses the hypothesis that is most
probable.
P (y | φ1(x), . . . , φN (x)) ∝ P (y)
N∏i=1
P (φi(x) | y) (2.1.9a)
(Maximum a posteriori estimation)w�y = arg max
y
(P (y)
N∏i=1
P (φi(x) | y)
)(2.1.9b)
A class prior may be calculated by assuming equiprobable classes, or by calculating an estimate for
the class probability from the training set, i.e., P (yi) = Number of samples of class itotal number of samples . To estimate the parameters
for a features’ distribution, it is necessary to assume a distribution for the likelihood of the features,
P (φi(x) | y), or generate non-parametric models for the features from the training set [49]. This is where
the different NB classifiers differ, since they make different assumptions regarding the likelihood.
Multinomial Naive Bayes
Multinomial Naive Bayes (MNB) implements the NB algorithm for multinomially distributed data,
where the distribution is parametrized by vectors θy = (θy1 , . . . , θyn) for each class y, where n is the
number of features and θyi is the likelihood, P (φi(x) | y), of feature i appearing in a sample belonging
to class y.
Commonly, it is used Maximum Likelihood Estimation to estimate the parameters θy given by equa-
tion 2.1.10, where Nyi =∑x∈T φi(x) is the number of times feature i appears in a sample of class y in
the training set, and Ny =
n∑i=1
Nyi is the total count of all features for class y.
θyi,empirical =NyiNy
(2.1.10)
The problem with the Maximum Likelihood Estimation estimate is that it is zero for a term-class com-
bination that did not occur in the training data, i.e., if a certain quantity did not appear during the training
phase, then the Maximum Likelihood Estimation estimates will be zero. This is due to sparseness since
13
the training data is never large enough to represent the frequency of rare events adequately.
One way to solve the sparseness problem is through smoothing techniques, where it is added a
pseudo-count factor α in the empirical posterior probability estimate that controls the smoothing strength
in order to account for features not present in the learning samples and prevents zero probabilities in
further computations. Given this, instead of using equation 2.1.10 to compute the feature likelihood
alternatively equation 2.1.11 will be used.
θyi,α-smoothed =Nyi + α
Ny + αn, α ≥ 0 (2.1.11)
Recalling equation (2.1.9a) the decision rule for the MNB classifier is given by equation 2.1.12.
y = arg maxy
(P (y)
N∏i=1
θyi,α-smoothed
)(2.1.12)
Complement Naive Bayes
Complement Naive Bayes (CNB) is an adaption of the MNB algorithm and is suited to deal with im-
balanced datasets on text classification tasks [50]. Particularly, CNB uses statistics from the complement
of each class to compute the θy vectors.
θyi =α+
∑j:yj 6=y dij
αn+∑j:yj 6=y
∑k dkj
(2.1.13)
The likelihood is thus computed as shown in equation 2.1.13, the summations are over all data points
xi /∈ class y, hence the name of the algorithm, dij is either the count or Term Frequency-Inverse Docu-
ment Frequency value of term i in document j, both explained in sections 3.6.1 and 3.6.2, respectively.
If feature φi(x) does not come from textual data e.g., a physiological variable, dij is the value of feature
φi(x) for data point xj . α is a smoothing hyper-parameter like that described before.
The classification rule is given by 2.1.14, i.e., an observation is assigned to the class with the lower
probability, that is the poorest complement match. The normalization wyi =log θyi∑j | log θyi |
addresses the
tendency for longer documents, when dealing with textual data, to dominate parameter estimates in
MNB.
y = arg miny
(P (y)
∑i
wyi
)(2.1.14)
14
2.1.3 Support Vector Machines
The two previous approaches assume that the training data adheres to a probabilistic structure, i.e.,
the relation between the input and the output can be described using a probabilistic framework. The
goal of such learning algorithms is thus to uncover such structure from the data by building probabilistic
models using the training data and relying on such model to make predictions about unseen data,
assuming that the model was capable to generalize beyond the training data.
Support Vector Machine (SVM) is an elective kind of approach in which data is its own one of a
kind model, this implying that this algorithm’s assumption is that the geometry of the observations is
central in deciding the outcome related with each single observation. The aim of the SVM classifier is to
maximize a geometric margin of hyper-plane, i.e., maximizing the distance between the examples in the
training set and the decision boundary, also known as the margin. In this type of classifiers, the distance
between the decision boundary and the closest observations in both classes is exactly the same.
Since this method maps directly the hidden information to their respective classes without modeling
any likelihood conveyance or structure of the data, just like the LR approach, SVM corresponds to a
discriminative model. The original SVM algorithm, developed by [51], was based on the work of Vladimir
N. Vapnik and Alexey Ya. Chervonenkis on pattern recognition [52]. The main disadvantage of the
original SVM algorithm is that it can only work when data is linearly separable. Another unfavourable
condition is that, even if there exists an hyper-plane capable on linearly separate our data, there may be
errors, e.g., outliers, present in the data. This condition may lead to a small margin and, since in general
the larger the margin the lower the generalization error of the classifier, this is not desirable.
In order to overcome such unfavourable conditions the standard SVM algorithm was modified in
order to accommodate non-linear transformations, this is known as the kernel trick and is achieved by
replacing every dot product by a non-linear kernel function [53]. Another adjustment was to have a less
strict objective function, this was achieved by adding slack variables to the optimization problem that
penalizes observations that fall inside the margin [54], this is known as soft-margin SVM.
To better understand the differences between these algorithms, figure 2.2 serves as a comparison
of the soft and hard margin algorithms given a toy set with two features φ1(x) and φ2(x), and an outlier
in one of the classes. It is illustrated in subfigure (2.2a) the effect of noise on the decision boundary
and the model’s margin when using the original SVM approach. The model has a very small margin
and the decision boundary is not capable of generalization. On the other hand, as showed in subfigure
(2.2b), the relaxation provided by the slack variables resulted in a bigger margin and the algorithm is
more capable to generalize better when compared to the one in subfigure (2.2a).
Like most ML methods, the present algorithm also has a set of hyper-parameters where the main
one is the kernel. It maps the observations into some feature space in order to make them more easily
separable after transformation. There are many types of kernel functions, since they can be defined as
the designer best finds, but the standard ones are the linear, radial basis, polynomial and sigmoid.
15
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0ϕ2(x)
1.0
1.5
2.0
2.5
3.0
3.5
4.0ϕ 1
(x)
Support Vectors
(a) Hard-margin SVM.
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0ϕ2(x)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
ϕ 1(x)
Support Vectors
(b) Soft-margin SVM.
Figure 2.2: Comparison between the hard-margin SVM and the soft-margin SVM for a toy set.
Given a classification problem with a training set that consists of N instances X = {x1, x2, . . . , xN},
a pre-defined set of K features φ = [φ1, φ2, . . . , φK ] and an outcome y = {−1, 1}, the (soft-margin) SVM
classifier is the solution to the constrained optimization problem depicted in equation 2.1.15.
The common hyper-parameter between the kernel functions, C, strikes a tradeoff between the width
of the margin and how much ”violations” to the margin are admissible. The vector w = [w1, w2, . . . , wK ]
is the normal vector to the hyper-plane, b is a constant, Φ is the transformation function, and ξi are the
penalty terms corresponding to the distance from the margin that in the optimization problem serve as
slack variables that allow ”violations” to the constraint yi(w>Φ(xi) + b) ≥ 1− ξi.
minw,b,ξ
{1
2〈w>,w〉+ C
N∑i=1
ξi
}subject to yi(w
>Φ(xi) + b) ≥ 1− ξi, (2.1.15)
ξi ≥ 0, i = 1, ..., N
The formulation described in equation 2.1.15 is known as the primal formulation, applying the La-
grange Multiplier method to this formulation and the Karush–Kuhn–Tucker conditions to the correspond-
ing Lagrangian the optimization problem becomes the one formulated in equation 2.1.16. For a full
mathematical description the reader is suggested to [55]. This formulation is named as the dual prob-
lem where α are the Lagrange multipliers and Q is a N by N positive semidefinite matrix with Qij ≡
yiyjK(xi, xj) and K(xi, xj) = φ(xi)>φ(xj) is the kernel.
minα
{1
2α>Qα− e>α
}subject to y>α = 0 (2.1.16)
0 ≤ αi ≤ C, i = 1, ..., N
16
The SVM classifier can be determined by solving the associated quadratic optimization problem,
either in its primal or dual form (equation 2.1.15 and equation 2.1.16 respectively). Early approaches
relied on specialized methods for quadratic programming but recent methods, however, instead rely
on local search approaches, such as CD and SGD, since the performance of such approaches are
very competitive when compared to other more complex optimization algorithms and are suitable when
dealing with large training sets [56]. To that purpose, and recalling the Empirical Risk Minimization
principle described in [46], we rewrite the problem formulated in equation 2.1.15 as the equivalent one
described in equation 2.1.17.
minw,b
{1
2〈w>,w〉+ C
N∑i=1
max{0, 1− yi(w>Φ(xi) + b)}
}(2.1.17)
The Empirical Risk for the soft-margin SVM classifier is given by equation 2.1.18, associated to the
loss function equation 2.1.19, also known as hinge loss.
RSVM =1
N
N∑i=1
max{
0, 1− yi(w>Φ(xi) + b)}
(2.1.18)
l(x, y, f) = max{
0, 1− yi(w>Φ(xi) + b)}
(2.1.19)
The decision function for this approach is expressed in equation 2.1.20 where w>Φ(xi) + b is the
equation of the separating hyperplane.
f(x) = sgn(w>Φ(xi) + b) (2.1.20)
2.2 Learnability
In subsections 2.1.1 and 2.1.3 the term risk was used without prior definition. Let’s write l(x, y, f) to
represent the loss incurred by a mapping function f given observation x when the desired outcome is y.
The value of l(x, y, f) can be expressed in terms of a simpler function, d, that measures the ”distance”
between two outcomes. In particular, d (y, y) denotes the observation-independent loss incurred by
choosing outcome y when the desired is actually y.
Then, given d, l can be expressed as equation 2.2.1, where y is a random variable denoting the
outcome prescribed by f given observation x. In the case of a dichotomous outcome, d (y, y) = I [y 6= y],
17
yielding equation 2.2.2.
l(x, y, f) = Ef [d (y, y)]def=∑y∈y
(d (y, y) f (y | x)) (2.2.1)
l(x, y, f) = 1− f(y | x) (2.2.2)
The expected loss, already mentioned as risk, of a mapping function f is given by equation 2.2.3,
where µD is the unknown underlying distribution given by equation (2.2.4), with µ0 = P[X = x].
L(f) = EµD[l(x, y, f)]
def=∑obser.
∑y
l(x, y, f)µD(x, y) (2.2.3)
µD(x,y) = P [Y = y | X = x]P [X = x] = f∗(y | x)µ0(x) (2.2.4)
In practice, prior knowledge usually takes the form of assumptions regarding the mapping function,
which introduces some biases. We refer to inductive bias as being the bias resulting from assumptions
that the learner uses to predict outputs given inputs that it has not encountered, and the set of mapping
functions considered by a learning agent as its hypothesis space, H.
The performance of a learning algorithm will depend on the hypothesis space considered and its
relation to the target mapping function f∗. For any learning algorithm A the following applies:
ED∼µD,f∼A(D)
[L(f)
]= L∗ +
(minf∈H
L(f)− L∗)
+
(ED∼µD,f∼A(D)
[L(f)
]−minf∈H
L(f)
)(2.2.5)
The first term of equation 2.2.5 corresponds to the unavoidable risk induced by the randomness of
the target mapping function. Such term is usually referred as the Bayes error rate and it does not depend
on any particular design choice. The second term, compares the risks of the best hypothesis in H and
the Bayes optimal policy. In a sense, this term accounts for the impact of the inductive bias in terms
of performance where richer hypothesis spaces tend to imply smaller inductive bias. It is, however,
independent of the learning algorithm and depends only on the choice of H. The third term measures
how accurately the learning algorithm A is able to identify the best hypothesis in H. It depends both on
A and the choice of H.
18
2.3 Overfitting and Regularization
Given a training set D and an arbitrary mapping function f , the risk, L(f), is expressed in equation
2.3.1. The decomposition comprises two components contributing to the risk incurred by f . The first
term corresponds to the Empirical Risk associated with f , measuring how ”well fit” f is to the training
data in D. The second term measures how well the empirical risk estimates the true risk of f .
L(f) = LN (f) + (L(f)− LN (f)) (2.3.1)
More generally, given a learning algorithm A the empirical risk is written as in equation 2.3.2.
ED∼µD,f∼A(D)
[L(f)
]= ED∼µD,f∼A(D)
[LN (f)
]+ ED∼µD,f∼A(D)
[L(f)− LN (f))
](2.3.2)
A dummy example that reports the resulting average loss incurred as a parameter, named C, varies
from 0 to 7, is presented in figure 2.3. The dashed line corresponds to the empirical risk, obtained in
the training set, while the solid line corresponds to the true risk, estimated from the data never seen by
the learning algorithm, usually referred as the test set. The plot showcases the aforementioned tradeoff
between fitting and stability. For small values of C the system incurs a big loss both in the training and
in the test sets. Such situation occurs when the loss incurred by the algorithm is dominated by the term
ED∼µD,f∼A(D)
[LN (f)
], a situation known as underfitting.
Conversely, for large values of C, the term ED∼µD,f∼A(D)
[L(f)− LN (f))
]dominates the loss in-
curred by the algorithm. In fact, as C increases, the empirical risk goes to zero, but the difference
between the empirical and true risks keeps increasing, indicating that the former becomes a bad esti-
mate for the latter. As a consequence, the learning algorithm becomes too specialized in the training
data and its ability to generalize increasingly deteriorates, as illustrated in figure 2.3. Such phenomenon
is known as overfitting.
Overfitting also implies that small changes to the training set may translate into significant changes
to the mapping function resulting from the learning algorithm. Therefore, in selecting the value for C,
there must exist a balance between the ability of the learning algorithm to fit the training data and its
stability to changes in such data.
The overfitting phenomenon results from the fact that µD is unknown and, therefore, one must rely
on the empirical risk as an estimate of L(f). Unfortunately, the empirical risk LN (f) underestimates the
true risk L(f), and the difference between the two increases as the complexity of the hypothesis space
increases. Hence, the selection of the hypothesis space involves a tradeoff between its ability to fit the
data and the stability of the learning algorithm. To some extent, the fitting-stability tradeoff in equation
2.3.2 is similar to the bias-complexity tradeoff in equation 2.2.5, and can be addressed by controlling the
19
0 1 2 3 4 5 6 7
0.9
0.0
0.15
0.3
0.45
0.60
0.75
Test Error
Training Error
Loss
C
L( ) − ( )f L N f
( )L N f
Figure 2.3: Empirical illustration of the fitting-stability tradeoff. The solid line corresponds to theestimated true risk, while the dashed line corresponds to the empirical risk. The plot highlights the two
terms in equation 2.3.1
hypothesis space.
An alternative approach to controlling the complexity of the hypothesis space is to use regularization.
As already stated in 2.1.1, the Empirical Risk Minimization principle sugests that the Empirical Risk can
be used to surrogate the true risk associated with a mapping function. Therefore, in terms of Empirical
Risk Minimization, learning consists in solving the optimization problem expressed in equation 2.3.3.
minf∈H
LN (f) (2.3.3)
The use of regularization offers an alternative to the Empirical Risk Minimization principle, known as
the Regularized Loss Minimization (RLM) principle [57]. In terms of RLM, learning consists in solving
the optimization problem in equation 2.3.4, where R(f) is a regularization term associated with the
mapping function f . Typically, R(f) should be higher for more complex hypothesis and lower for simpler
hypothesis generating a more parsimonious model. The regularization term thus acts as a “stabilizer”,
indicating it is admissible to perform a little worse in the training set if that means using a simpler (and
potentially more stable) hypothesis.
minf∈H
(LN (f) +R(f)
)(2.3.4)
The L1−regularization method, also called Least Absolute Shrinkage and Selection Operator (LASSO)
regression adds absolute value of magnitude of coefficient, as represent by ‖W‖1 in equation 2.3.5, as
20
penalty term to the loss function. LASSO shrinks the less important features coefficient to zero, i.e.,
some of the features are completely neglected for the evaluation of the outcome. So, LASSO helps to
reduce the model complexity.
Equation 2.3.5 expresses the LASSO regression regularization function, where λ is the tuning pa-
rameter that decides how much penalty incurs into the flexibility of the model. The increase in flexibility
of a model is represented by increase in its coefficients, and if it is desired to minimize equation 2.3.4,
then these coefficients need to be small.
RLASSO(f) = λ‖W‖1 (2.3.5)
In L2− regularization method, also called as Ridge regression, the cost function is altered by adding
a penalty equivalent to square of the magnitude of the coefficients, as represent by ‖W‖2 in equation
2.3.6. The penalty term λ regularizes the coefficients such that if the coefficients take large values the
optimization function is penalized. So, ridge regression shrinks the coefficients.
RRidge(f) = λ‖W‖2 (2.3.6)
2.4 Model Assessment
In order to measure the risk of a model according to an hypothesis and the ability of the given model
to generalize beyond the training set it is necessary to choose a metric. According to [58, 59], the most
common metric used in healthcare is the ROC-AUC. It should be taken into account that when dealing
with statistical models there is always some variability and uncertainty associated to them since it de-
pends on the population in study. In order to quantify this uncertainty one resorts to confidence intervals,
applied to the performance metrics, and the McNemar’s hypothesis test to compare ML algorithms.
2.4.1 Performance Metrics
The ROC-AUC, sometimes referred as c-statistic, is a performance measurement for classification
problems at various thresholds settings. It tells how much the model is capable of distinguishing between
classes. The higher the ROC-AUC score, the better is the model at predicting the outcome.
An excellent model has a ROC-AUC score near to the 1 which means it has good measure of
separability. And when the ROC-AUC score is 0.5, it means that the model has no class separation
capacity whatsoever, i.e., it is not better than a random guess.
ROC plots the False Positive Rate (FPR), in the x-axis, versus the True Positive Rate (TPR), y-axis,
for a number of different candidate probability threshold values between 0.0 and 1.0. Put another way, it
21
0.0 0.2 0.4 0.6 0.8 1.0FPR
0.0
0.2
0.4
0.6
0.8
1.0
TPR
Model with no skillROC Curve
Figure 2.4: AUC-ROC Curve for a dummy example using a LR model.
plots the false alarm rate versus the hit rate.
Figure 2.4 illustrates a ROC for a dummy set using a LR model. As one can note, each probability
threshold value corresponds to a (FPR,TPR) pair of values. Not only it is possible to use the ROC curve
as a performance measure as well as a tool to choose the threshold that best discriminates between
classes.
The TPR, also called sensitivity, describes how good the model is at predicting the positive class
when the actual outcome is positive. This quantity is calculated as expressed in 2.4.1, where True
Positive (TP) is an outcome where the model correctly predicts the positive class and False Negative
(FN) is an outcome in which the model predicts the negative class, when in reality it is positive.
Sensitivity =TP
TP + FN(2.4.1)
The FPR is the proportion of all negatives that still yields positive test outcomes, i.e., the conditional
probability of a positive test result given an event that was not present. equation 2.4.2 shows how to
compute the FPR, where FP is an outcome in which the model predicts the positive class, when in reality
it is negative.
FPR =FP
FP + TN(2.4.2)
The FPR is also referred to as the inverted specificity where specificity measures the proportion
of actual negatives that are correctly identified as such. The relation between specificity and FPR is
given by equation 2.4.3, where True Negative (TN) is an outcome where the model correctly predicts the
22
negative class.
Specificity = 1− FPR =TN
TN + FP(2.4.3)
Departing from the ROC curve and the performance measures associated with it. Precision is an-
other way to assess a model performance. Precision, presented in equation 2.4.4, measures the pro-
portion of real positives from all of the instances that were identified as being positive.
Precision =TP
TP + FP(2.4.4)
Another way to present the values of precision and sensitivity is by using the F1-Score. The F1-Score
is the harmonic mean of precision and sensitivity which is given by equation 2.4.5. So if one number is
really small between precision and sensitivity, the F1-Score is more closer to the smaller number than
the bigger one, giving the model an appropriate score.
F1-Score = 2× Precision× SensitivityPrecision+ Sensitivity
(2.4.5)
In order to measure the interrater reliability one resorts to the Cohen’s Kappa Statistic, κ. This
coefficient is statistically robust when dealing with unbalanced classes given that this statistic favors
correct classifications of the minority class over the majority one. This statistic is defined as expressed
in equation 2.4.6, where po is the empirical probability of agreement on the label assigned to any sample,
and pe is the expected agreement when both annotators assign labels randomly.
κ = (po − pe)/(1− pe) (2.4.6)
2.4.2 Bootstrap Confidence Interval
A confidence interval is a bounds on the estimate of a population variable, e.g. performance metrics.
It is an interval statistic used to quantify the uncertainty on an estimate [60]. It is common practice to
assume, in binary classification problems, that the predicted variable can be described as a succes-
sion of independent events that either succeed or fail, i.e. a Bernoulli trial, and when the sample is big
enough one can assume that the predicted variable is normally distributed and one can then compute
a parametric confidence interval. Despite the common use of parametric confidence intervals the as-
sumptions that underlie parametric confidence intervals are oftenly violated since the predicted variable
is not always normally distributed, and even if it is, either the variance of the normal distribution is not
equal at all levels of the predictor variable or the assumption that each sample is independent does not
stand [61]. Since the underlying distribution is unknown one resorts to non-parametric estimation of
23
confidence intervals. In these cases, the bootstrap resampling method can be used as a non-parametric
method for estimating confidence intervals of a parameter for a given population. The bootstrap is a sim-
ulated Monte Carlo method where samples are drawn from a fixed and finite dataset with replacement
and a parameter, e.g. performance metrics, is estimated on each sample. This procedure leads to a
robust estimate of the true population parameter via sampling.
When doing bootstrapping in the field of ML it is common to assume that the resampled population
is of the same size of the original one but other proportions might be considered , e.g. 80% depending
on time complexity of the models, computational power and available time. This resampled population
is used for training a ML model and when resampling is finished there will be some samples that were
not used, this ones are called out of bag samples and are used to test the model performance. In
order to get reliable estimates it is recommended to draw 50 to 200 bootstrap samples [62]. For each
bootstrap round the population variable is computed and stored until the number of pre-defined bootstrap
rounds is achieved. Given the stored variables one can compute confidence intervals given a degree
of confidence α. Commonly confidence intervals are computed assuming a 95% confidence in order to
reduce uncertainty [63]. A 95% confidence interval means that when repeating the sampling one should
expect that one time out of twenty intervals will not include the true population variable value.
One way to compute the bounds of the confidence interval is by using the percentile method [64],
where the lower bound and the upper bound are computed using the the α1 and α2 percentiles of the
bootstrapped distribution, respectively. α1 = α is the α1 percentile of the bootstrapped distribution and
α2 = 1− α is the α2 percentile of the bootstrapped distribution used for computing the 100× (1− 2× α)
confidence interval given a degree of confidence α.
2.4.3 McNemar’s Hypothesis test
Hypothesis testing is a statistical method that is used in making statistical decisions using experi-
mental data. Hypothesis testing is an essential procedure in statistics since it evaluates two mutually
exclusive statements about a population to determine which statement is best supported by the sample
data. The parameters associated with hypothesis testing are:
• Null Hypothesis (H0) - General statement that there is no relationship between two measured
phenomena, or no association among groups;
• Alternative Hypothesis (H1) - Hypothesis used in hypothesis testing that is contrary to the null
hypothesis;
• Level of significance (α) - Refers to the degree of significance in which one accept or reject the
null hypothesis;
• P-value (p) : The P-value is the probability of finding the observed results when the null hypothesis
of a study question is true.
24
Like confidence intervals, there are some complications when choosing the right approach to do
hypothesis tests. There are also parametric and non-parametric strategies, where the parametric one
assume normally distributed data. Since this assumption is commonly violated one can resort to non-
parametric, i.e., distribution-free, hypothesis tests. Another advantage of this approach is its compu-
tational efficiency and it is suggested when it is expensive to train multiple copies of classifier models
on big datasets [65]. The McNemar’s hypothesis test assesses if a statistically significant change in
proportions have occurred on a dichotomous trait. This test operates upon a 2 × 2 contingency table,
which is a count of two binary variables for a testing and training set as illustrated in tables 2.1 and 2.2.
Table 2.1: Dummy example for the development of a Contingency Table.
Instance Classifier1 Correct Classifier2 Correct1 Yes No2 No No3 No Yes4 No No5 Yes Yes6 Yes Yes7 Yes Yes8 No No9 Yes No10 Yes Yes
Table 2.2: Contingency table derived from the dummy example.
Classifier1Correct Classifier1IncorrectClassifier2Correct 4 (Y es/Y es) 1 (No/Y es)Classifier2Incorrect 2 (Y es/No) 3 (No/No)
The McNemar’s test statistic is calculated as illustrated in equation 2.4.7, where Y es/No is the count
of test instances that Classifier1 correctly classified and Classifier2 incorrectly classified, and No/Y es
is the count of test instances that Classifier1 incorrectly classified and Classifier2 correctly classified.
As it is noticeable, the statistic is reporting on the different correct or incorrect predictions between the
two models, not the accuracy or error rates. In terms of comparing two binary classification algorithms,
the test is commenting on whether the two models disagree, or not, in the same way and not commenting
if one model is more or less accurate than other.
The two terms used in the calculation of the McNemar’s Test capture the errors made by both mod-
els. The test checks if there is a significant difference between the counts in these two cells. If these
cells have counts that are similar, it shows us that both models make errors in approximately the same
proportion, only on different instances of the test set. In this case, the result of the test would not be
significant and the null hypothesis would not be rejected, given a significance level as illustrated in equa-
tion 2.4.8a where p is the p-value. If these cells have Y es/No and No/Y es counts that are not similar,
it shows that both models not only make different errors, but in fact have a different relative proportion
of errors on the test set. In this case, the result of the test would be significant and one would reject the
25
null hypothesis, given a significance level as summarized in equation 2.4.8b.
statistic =(Y es/No−No/Y es)2
(Y es/No+No/Y es)(2.4.7)
p > α =⇒ fail to reject H0 (2.4.8a)
p <= α =⇒ reject H0 (2.4.8b)
26
Chapter 3
Natural Language Processing
Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that deals with the in-
teraction between computers and humans using the natural language. The objective of NLP is focused
on enabling computers to read, decipher, understand, and process human languages, to get computers
closer to a human-level understanding of language. Computers do not have the same intuitive under-
standing of natural language that humans do. They cannot really understand what the language is really
trying to say.
Language is a type of correspondence between humans. One human discharges a message spoken
by a particular combination of acoustic or graphic signs to someone else who imparts some common
sense knowledge to the sender which should empower the receiver to comprehend the message.
It was the philosopher Charles Morris who presented the triplet ”syntax-semantics-pragmatics” [66]
and expressed that the investigation of pragmatics incorporates the total condition of an individual who
speaks or hears. This incorporates the investigation of semantics regarding the relationship of expres-
sions to their meaning. The study of syntax looks at the properties and structure of a language. The
lexicon and morphology are sublevels of the syntactic dimension and concern the analysis of the words
and word development (inflection, derivation and compounding).
Each dimension can create ambiguities, some of them must be settled by the learning of the genuine
situation that is portrayed by the sentence. A complete NLP framework ought to have the capacity to
deal with all the referenced dimensions (to a limited degree) [67]. Some propose a fell engineering
where the ambiguities of the lower levels are settled successively by the resulting levels [68].
As of now, in the field of linguistics, it is considered as valuable to limit research to well-characterized
sublanguages [69]. A sublanguage is a specialized language that is utilized by the different on-screen
characters in the specialized field to pass explicit messages. A specialized language exhibits a few
attributes that separate it from the general language [70]. An obvious purpose of contrast is the vocabu-
lary. In each sublanguage, typical words can be found in their standard meaning. Be that as it may, a lot
of general language words can take an increasingly confined and explicit significance with regards to a
27
General Language
Engineering Sublanguage
Healthcare SublanguagePhilosophy Sublanguage
Thermodynamics
Mechanics
Control
Emergency Medicine
Intense Care
Triage
Ethics
Metaphysics
Existentialism
Social media Sublanguage
Slang
Abbreviation
Figure 3.1: Partition of a general language into specialized languages.
sublanguage [69]. Different words have a general importance alongside a progressively specialized one,
of which just the last is utilized with regards to the sublanguage. At last, for every specialized space,
there exists a lot of quite certain vocabulary that is for the most part solely utilized in that specific spe-
cialized area [71]. An example of a partitioned general language into sublanguages with their respective
vocabulary can be depicted in figure 3.1
Computers are great at working with standardized and structured data since they are able to process
that data much faster than humans can. But humans don’t communicate in structured data, as already
stated, humans resort to language, a form of unstructured data. When programming computers, the
developer is essentially giving the computer a set of rules that it should operate by. With unstructured
data, these rules are quite abstract and challenging to define concretely due to ambiguities arising from
the ”syntax-semantics-pragmatics” triplet.
Humans have been recording text data over the years. Over that time, our mind has picked up
a gigantic measure of involvement in understanding natural language. When we read something, we
comprehend what it truly implies in reality since we know its context.
That being stated, late advances in Machine Learning have empowered computers to do a con-
siderable amount of valuable tasks with natural language [72] such as machine translation, speech
recognition, speech synthesis, language translation, semantic understanding, and text summarization
[73,74].
Given the overview on NLP let’s make the following question, what makes NLP difficult? The following
list summarizes some answers to such question:
• Number of natural languages - each one has distinctive linguistic rules;
• Ambiguity - sentence/words meaning is dependent on its context;
• Idiomatic expressions - for each natural language there are expressions that gain new connota-
tive meanings and go beyond their literal meanings;
28
• Grammar - Homonyms, homophones, homographs, parons and synonyms words;
• Negative structure - A sentence can be in the negative form.
The following subchapters provide a theoretical explanation of the techniques used to process the
text in order to deal with such complex data.
3.1 Regular Expressions
Formally, a Regular Expression (RE) is an algebraic notation for characterizing a set of strings, i.e.,
they can be seen as an arrangement of one or more character literals, operators, or constructs that
define a search pattern. This strategy is typically used by string search algorithms in tasks like, for
example, ’find’ or ’find and replace’ a string. Given a search pattern, the RE will scan through the corpus
of texts, returning all textual instances that match the pattern.Taking this into account, one can say that
this technique is very useful in extracting information from any text.
Like any language, RE has its own vocabulary and syntax that one must know. One of the operations
within the domain of RE is concatenation. Given two RE A and B, they can be concatenated to form
new RE AB. In general, if a string s1 matches A and another string s2 matches B, the string s1s2 will
match AB. Suppose that the document being analyzed is given by ’The production management exam
was very difficult!’, the RE given by A = exam will match the string s1 = ’exam’ and the RE given by B =
difficult will match the string s2 = ’difficult’. IfA andB are concatenated intoAB, the output will be s1s2 =
’exam difficult’. The previous example shows one of the most common uses of RE, matching characters,
where it was showed that most letters and characters will simply match themselves. There are some
exceptions to this saying because some characters are denominated as metacharacters which do not
match themselves. Rather, they flag that some singular string should be matched, or they influence
different parts of the RE by rehashing them or changing their significance. table 3.1 summarizes these
characters and their respective task. In order to best understand how these metacharacters work a set
of examples is presented inside the healthcare scope, presented in table 3.2. Suppose that managers of
the healthcare facility want to obtain the information of patients with less than 10 years old. One simple
way to provide a solution to this problem can be by means of the . character by applying the RE with
pattern search equal to . yielding the last row of table 3.2. Another solution would be to generate a set
of characters, by means of the [ ] character, as [0 − 9]. This search pattern returns a match for any
age between 0 and 9. If the objective now is to extract all rows where the patient temperature is bigger
than 40 oC one can resort to metacharacter ˆ, where the RE search pattern is given by 4. This returns
a match for every temperature whose value starts with a 4. If the healthcare managers want know how
the number of visits has been evolving over the years it is necessary to extract the year of the Date field.
One way to solve such task is by using a concatenation of special sequences, since the metacharacter
that is being used is \, like for example \d\d\d\d, this will output all digits, ranging from 0000 to 9999.
One can rewrite the last RE, using the { } character, as \d{4}, this will search for a digit, in the range
29
0 − 9, four times. Suppose that now the goal is to create a new table with all patients that have Non-
urgent and standard priorities, coded as 5 and 4 respectively. A RE that can extract the information in
order to solve this problem is the one given by 4|5, where a match exists if the digits 4 or 5 appear. If the
healthcare managers need to extract the patients whose discriminators are either Head Injury or Chest
Pain one can use the ∗ character and generate the RE expressed as he*, this will match any string that
has an h even if it is not followed by e or followed several times. Another solutions consist on using + or
? instead of ∗ since we are imposing that character e needs to appear at least once or that it can appear
at most once.
In the previous set of examples the metacharacter \ was used in order to search for the year of a
date. This was achieved using the \d special sequence. table 3.3 summarizes the special sequences in
the RE language with a short description. Despite the simplicity of the dummy example, it is possible to
see the versatility and potentialities of RE.
Table 3.1: Regular Expression (RE) metacharacters description.
MetacharactersCharacter Description
. Matches any character except a newline
ˆ Matches the start of the string for a given RE
$ Matches the end of the string for a given RE
∗ Makes the subsequent RE match 0 or more occurrences of the previous RE
+ Makes the subsequent RE match at least 1 repetition of the previous RE
? Causes the resulting RE to match 0 or 1 repetitions of the preceding RE
{ } The resulting RE matches exactly the specified number of repetitions of the previous RE
[ ] Used to indicate a set of characters
\ Either escapes special characters, or signals a special sequence
| Either or operator
( ) Capture and group whatever RE is inside the parentheses
Table 3.2: Dummy Example for Regular Expressions.
ID Temperature [oC] Gender Age Date Priority Discriminator1 37.5 M 37 21-03-2017 09:42 5 Dyspnea2 42 F 28 01-11-2018 15:22 3 Fever3 37 F 42 09-01-2019 08:09 5 Head Injury4 38 M 58 14-10-2018 04:17 4 Chest Pain5 36.8 F 6 15-04-2019 04:17 5 Head Injury
30
Table 3.3: Special Sequences description.
Backslash MetacharacterSequence Description\A Matches only at the start of the string
\B Matches the empty string, but only when it is not at the beginning or end of a word
\b Matches the empty string, but only at the beginning or end of a word
\D Matches any non-digit character
\d Matches any decimal digit
\S Matches any non-whitespace character
\s Matches any whitespace character
\W Matches any non-alphanumeric character
\w Matches any alphanumeric character and the underscore
\Z Matches only at the end of the string
3.2 Tokenization
Tokenization can be described as the task of segmenting texts into tokens, the smallest unit of infor-
mation, for further processing. This can be seen as one of the first steps when pre-processing textual
data. Tokenization strategies are not so trivial to implement since it depends on the language (or sub-
language), goal and on the quality of the data [75,76].
First it is necessary to specify what is the smallest unit of information. Depending on the goal, this
can be anything from a character to compounds. In this work the smallest unit of information, tokens
from now on, is the word. Secondly, the way that the tokenizer is built is subject to the domain where it
is going to be applied.
In the portuguese language one simple strategy to build a tokenizer is to consider all spaces as de-
limiters between tokens. Applying this tokenizer to the sentence ’Hoje o dia esta soalheiro’ (’Today it is
sunny’) the tokenizer will output [’Hoje’,’o’,’dia’,’esta’,’soalheiro’]. Adding a dot to the same example yelds
the sentence ’Hoje o dia esta soalheiro.’, which for us humans is the same. Applying the same tokenizer
will result in [’Hoje’,’o’,’dia’,’esta’,’soalheiro.’]. Now here is a problem, the dot is attatched to the word ’soal-
heiro’ (’sunny’). Humans can interpret ’soalheiro.’ as being ’soalheiro’ but for a computer these are two
different tokens, resulting in a problem, very common in NLP [77], known as data sparseness. Another
approach may be assuming that spaces and punctuation are the delimiters between tokens, the result-
ing tokenizer will output [’Hoje’,’o’,’dia’,’esta’,’soalheiro’,’.’] which is the desired one. The consequence of
this tokenizer is when dealing with the sentence ’Esqueci-me do guarda-chuva’ (’I forgot the umbrella’)
since it will output [’Esqueci’,’-’,’me’,’do’,’guarda’,’-’,’chuva’] instead of [’Esqueci-me’,’do’,’guarda-chuva’]
and thus the original semantic meaning is lost.
There are other type of methodologies when dealing with tokenization, that deal with the complica-
tions described above, resorting to approaches such as the Viterbi Algorithm, Hidden Markov Models,
Part-of-Speech tagging [78–81]. Another sort of difficulties during the tokenization process, when us-
31
ing these type of statistical methods, is dealing with misspellings. When doing part-of-speech tagging,
process of categorizing words based on its definition and relationship between adjacent words, it is
necessary to have words spelled correctly in order to obtain a words part-of-speech [82]. One way to
implement the Viterbi algorithm, a dynamic programming algorithm whose goal is to find the most likely
sequence of hidden states that explains a sequence of observations for a given stochastic model, when
used for tokenization, is to compute the probabilities of the words in a sentence, based on their char-
acters, and choose the most likely sequence of characters [83]. In order to compute each probability it
is required a word frequency list that best describe the (sub)language. One can easily state that this is
not so trivial since it is extremely complicated, not to say impossible, to obtain a corpus that contains
all situations the (sub)language. Assuming that there exists such fully developed corpus, the problem
associated with misspellings persist since it will break the most likely sequence of characters, e.g. ’am-
norreia’ (the word ’amenorreia’ (’amenorrhea’) with a misspelling) may yield [’amnor’,’reia’] depending
on the used corpus. One way to counter this is by smoothing strategies, already described in subsection
(2.1.2).
In order to achieve good results with these techniques it is necessary to have a rich annotated corpus
that can best describe the (sub)language, however to the best of the author’s knowledge, there are no
free and open-source resources for the portuguese language focused on the healthcare domain.
3.3 Word Similarity
One of the problems that concerns a NLP system, as extensively stated before, is the detection of
misspellings. Most typographical errors, typos from now on, can be described by delete, transpose,
replace, or insert operations of a small number of characters. A special form of these kind of typos
are coined as atomic typos. An atomic typo is a misspelling where the misspelled word is an existing
word, e.g., ”I did not understand the massage you sent me” where ”massage” is a typo of ”message”.
In order to deal with such typos it is necessary to resort to advanced techniques that take into account
the context of the word. Another kind of typos reside on phonetically similar words, e.g. ”If the resultant
force on an object is zero, a stationery object stays stationery” where stationery is a typo for stationary.
A word can be compared to another by means of its orthographic form or its phonetic value, in
this work only the orthographic form was considered. When comparing two strings one can partition
the string similarity functions into two groups. A group devoted to compare strings by measuring their
distance, where a smaller distance indicates similar strings and another group that calculates a ratio,
usually ranging from 0 to 1, where 0 indicates no similarity between the strings. This last group of
similarity functions is known as approximate string matching. Immense work is devoted to spellchecking
and a panoply of methods exist, such as dictionary-based spelling correction, N-gram methods, word
frequencies, edit distance, phonetic algorithms, among others [84–86].
In the present work approximate string matching was used. An efficient and conceptually simple
32
method for measuring string similarity is the Jaro–Winkler distance, given by equation 3.3.1c. The
Jaro-Winkler distance metric is an extension of the Jaro distance, given by equation 3.3.1a where m
is the number of matching characters between strings s1 and s2, t is half the number of transposition
operations, and | s1 |, | s2 | are the length of strings s1 and s2, respectively. It scores the number of
common character in the correct order and assumes that differences near the start of the string are more
significant than differences near the end of the string [87,88], as illustrated in equation 3.3.1b since l is
the length of the common prefix, up to a maximum of four characters, and p is a constant scaling factor
for how much the score is adjusted upwards for having common prefixes, empirical works recommend a
value of p = 0.1 [89]. This metric ranges from 0 to 1, when the value of the Jaro–Winkler distance is 0, it
means that the strings are identical. This string similarity function was chosen due to it’s computational
speed, scalability and because usually this approach performs well in short string comparisons, and is
extensively use for duplicate detection in the linkage area [90,91]. As an example, let’s assume s1 = jon
and s2 = john. The matching characters between these strings is ’jon’ yielding m = 3, the number
of transpositions is 0 giving t = 0 since the matching characters are already in the same order, the
length of the strings are | s1 |= 3 and | s2 |= 4. Replacing these values on equation 3.3.1a yields
simj (s1, s2) = 0.917. The longest common prefix is given by jo, resulting in l = 2, replacing the value
obtained for simj and l, given p = 0.1, in equation 3.3.1b, the similarity between these two strings is
simjw (s1, s2) = 0.934, yielding a Jaro-Winkler distance of djw (s1, s2) = 0.076 meaning that the strings
are very similar.
simj (s1, s2) =
0, if m = 0
13
(m|s1| + m
|s2| + m−tm
), otherwise
(3.3.1a)
simjw (s1, s2) = simj (s1, s2) + lp (1− simj(s1, s2)) (3.3.1b)
djw (s1, s2) = 1− simjw (s1, s2) (3.3.1c)
3.4 Text Normalization
Imagine that you are presented with these strings: ’12-2-2019’, ’12/2/2019’, ’12 February 2019’,
’12fev19’. Easily the reader detects that these represent a date and it is always the same. But how does
a computer interpret these strings? For a computer each string is different from each other, despite the
meaning being the same. Text normalization can be seen as the task of reducing all textual data into a
canonical form in order to reduce data sparsity.
Usually, one of the first steps in text normalization is to lowercase, or uppercase, the textual data.
Depending on the application, this might not be a good idea since ’US’ (United States) becomes ’us’
after lowercase and some form of disambiguation needs to exist in the NLP system in order to know if
’us’ represents the noun or the pronoun. Another aspect to take into consideration are abbreviations
33
since many abbreviations may represent the same thing, e.g., the abbreviations ’msd ’, ’msupd ’, ’mem-
bsupdir ’, ’msdireito’ have the same meaning and stand for ’membro superior direito’ (’right upper limb’).
Other textual characteristics that need to be handled are the ones related to temporal representation.
Extracting temporal information is a challenging task and immense work has been dedicated to this task
[92]. Suppose that the sentence ’I had paracetamol one hour ago’ is given as chief main complaint to the
healthcare provider during triage. Resorting only to this information, the healthcare provider can infer
that the patient is complaining about a fever or some sort of pain and knows that the patient can’t take
more substances which contain paracetamol. Temporal expressions are important since they enrich
data with more information regarding events and the time or day they happened, with a certain degree
of uncertainty.
Suppose you are provided with a text document that contains the following sentence ’Ana studies a
lot. She has been studying for seven days.’. The reader directly understands that Ana dedicates her time
to study, given the tokens studies and studying whose context say so. It is normal to encounter different
forms of a word in a document since there exists grammatical rules and some words are derivationally
related to other words with similar meanings, e.g., democracy, democratic, and democratization. One
way to reduce the level of sparsity can be achieved by decreasing inflectional structures and at times
derivationally related types of a word to a typical base structure, which can be accomplished by using a
stemmer. Stemming techniques work by removing the end or the start of the word, by means of crude
heuristic process, considering a rundown of basic prefixes and suffixes that can be found in an inflected
word. This aimless cutting can be successful in certain occasions, however not generally. As the author
knows best, there exists only a few stemmers for the portuguese language, the Porter and the RLSP
Stemmer. In this work the RLSP stemmer was used since it makes less understemming errors and less
overstemming errors when compared to the Portuguese version of Porter’s Algorithm [93].
The RLSP algorithm, illustrated in figure 3.2, is composed of eight steps where each one has a set of
rules. Summarizing each step, the plural reduction has eleven stemming rules since not all plural forms
in portuguese end in ’-s’ and not all words ending in ’-s’ correspond to a plural, e.g. ’oculos’ (’glasses’).
The Feminine Reduction step is composed by fifteen stemming rules and consists in transforming words
in the feminine form to their corresponding masculine form. This step only considers words finishing
in ’-a’. The third step, Adverb Reduction, is the simplest step since it only deals with adverbs that
have as suffix ’-mente’, although not every word with this suffix corresponds to an adverb and when
one uses two or more adverbs with this suffix in a sentence only the last one has this suffix. E.g. ’O
Joao apresentou-se pobre, triste e humildemente, por causa do que tinha acontecido.’ (’John showed
himself poor, sad and humble because of what had happened.’), where ’pobre’ and ’triste’ correspond
to adverbs. The augmentative reduction step comprises 23 rules and it is responsible for removing the
suffixes of nouns and adjectives in the augmentative, diminutive, or superlative form. The fifth step
tests words against 84 noun endings, if a suffix is removed then stemming process is finished and
the remaining steps are not executed. The verb reduction step consists on analyzing the form of the
verb, for a regular verb there are over fifty forms and each one of them has its own suffix. Due to the
34
possibilities the verbal forms are reduced to their root form using 101 rules. The vowel removal step
consists in removing the word last vowel. Lastly, the accents removal step, as the name states, consists
on removing accents and there exists eleven stemming rules. Serves as an example the following
words: ’medicado’, ’medicada’ (both stand for ’medicated ’), ’medicamento’ (’medicine’), ’medicamentos’
(’medicines’), ’medicacao’ (’medication’), and ’medicacoes’ (’medications’). Here we have six different
words but when applying the RLSP Stemmer they all become ’medic’ showing the importance of this
task and its hability to reduce sparsity.
3.5 N-grams
N-grams can be shortly defined as a contiguous sequence of anything from textual data. When
one says anything it is because we can have n-grams with respect to characters, words, phonemes,
syllables, etc.. Since the smallest unit of information in this work is the word then when referring to
N-grams the reader knows that the sequence is given by contiguous N words. Therefore, a unigram
is a sequence of one word, a bigram is a sequence of two words, trigrams sequences of three words,
and so on. N-grams are widely applied in language identification, spelling error detection and correction,
query expansion, information retrieval with serial, inverted and signature files, dictionary look-up, missing
phoneme guessing, machine translation, spam filter, topic spotting, and text compression [94–96].
As an example, consider the sentence ’Once upon a time’. One can consider an unigram model,
then the result will be ’[’Once’, ’upon’, ’a’, ’time’ ]’. Supposing now that the model is given by bigrams the
result is ’[’Once upon’, ’upon a’, ’a time’ ]’. If now we consider trigrams, ’[’Once upon a’, ’upon a time’ ]’.
One can also combine different N-gram models, like for example considering all N-grams, with N ranging
from 1 to 3, outputing ’[’Once’, ’upon’, ’a’, ’time’,’Once upon’, ’upon a’, ’a time’, ’Once upon a’, ’upon a
time’ ]’.
3.6 Feature Extraction
In order to use textual data in a ML algorithm, it is necessary to have some sort of numerical repre-
sentation of it. The objective of this section is to provide the reader with a theoretical overview of two
techniques used to extract knowledge from textual data.
3.6.1 Bag of Words
Bag of Words (BoW) is an approach in NLP to represent a document as the multi-set of N-grams
that appear in it. This creates a simplified vector representation of the text, where the (frequency of)
occurrence of each N-gram is later used as features for a ML model. In this simple model, the syntax
35
Begin
Word ends with an 's'?
Plural Reduction
Yes
Word ends with an 'a'?
No
Feminine Reduction
Yes
Augmentative Reduction
No
Adverb Reduction
Noun Reduction
Suffix Removed
Verb Reduction
No
Remove Accents
End
Yes Suffix
Removed
Yes
Vowel Removal
No
Figure 3.2: Flowchart for the RLSP Stemmer. Adapted from [93].
36
and even the order of words is discarded, only telling weather a word is present, or not, in a document.
Let’s assume that the set of text documents is given by [s1, s2] where s1 =’John likes to program
Vanessa also likes to program’ and s2 =’John also likes to go to concerts’. Using the unigram model,
the list of unique words, also described as the vocabulary, is given by [’John’, ’likes’, ’to’, ’program’,
’Vanessa’, ’also’, ’go’, ’concerts’]. Assuming the same ordering, the vectorized documents are given by
table 3.4.
Table 3.4: Bag of Words representation for the dummy example.
’John’ ’likes’ ’to’ ’program’ ’Vanessa’ ’also’ ’go’ ’concerts’s1 1 2 2 2 1 1 0 0s2 1 1 2 0 0 1 1 1
As can be noted, the vectorized representation of each document did not preserve the original order,
loosing its syntactical and semantic meaning, and in this simple example it is noticeable the zero entries.
When the size of the vocabulary start to increase, the sparseness of the feature vectors also increase.
Similarly, the BoW model can consist on count frequencies of bigrams, trigrams, combination between
uni and bigrams, etc..
3.6.2 Term Frequency-Inverse Document Frequency
Term Frequency-Inverse Document Frequency (tf-idf) is a class of techniques to represent a doc-
ument in a vectorized form, and it is useful in identifying signature N-gram in a document. The Term
Frequency, expressed in equation 3.6.1, is related with the output of the BoW model since it is defined
as how frequently the N-gram appears in the document, measuring the local importance of it.
tf(N−gram) =Number of times the N−gram appeared on the document
Number of N−grams in the document (3.6.1)
The inverse document frequency, expressed in equation 3.6.2, is the key factor in identifying the
signature N-gram. It is based on the fact that less frequent N-grams are more informative and important.
For an N-gram to be considered a signature N-gram of a document, it should not appear that often in
the other documents. Thus, a signature N-gram’s document frequency must be low, meaning its inverse
document frequency must be high.
idf(N−gram) = log10
(Number of documents
Number of documents containing the N−gram
)(3.6.2)
The tf-idf is the product of these two frequencies. For an N-gram to have high tf-idf in a document, it
must appear several times in that document and must be absent in the other documents. Thus being a
signature N-gram of the document.
Resuming the previous example, the term frequency for each document is given by table 3.5, which
was expected. Computing the idf terms yields the results showed in table 3.6. Since we have the
37
values for both elements, the tf-idf weights are illustrated in table 3.7. With this we can conclude that the
best unigrams that describe the first sentence is ’program’ and ’Vanessa’. There is a tie for the signature
unigram, for the second sentence, between ’go’ and ’concerts’ unigrams. Another consideration to make
is the sparsity level of the feature vectors, where one can conclude that the feature vectors generated,
either by BoW or tf-idf techniques, are sparse.
Table 3.5: Term Frequency of each unigram to its corresponding sentence for the dummy example.
’John’ ’likes’ ’to’ ’program’ ’Vanessa’ ’also’ ’go’ ’concerts’s1 0.11 0.22 0.22 0.22 0.11 0.11 0 0s2 0.14 0.14 0.29 0 0 0.14 0.14 0.14
Table 3.6: Inverse Document Frequency (idf) of each unigram for the dummy example.
’John’ ’likes’ ’to’ ’program’ ’Vanessa’ ’also’ ’go’ ’concerts’idf 0 0 0 0.3 0.3 0 0.3 0.3
Table 3.7: Term Frequency - Inverse Document Frequency of each unigram to its correspondingsentence for the dummy example.
’John’ ’likes’ ’to’ ’program’ ’Vanessa’ ’also’ ’go’ ’concerts’s1 0 0 0 0.066 0.033 0 0 0s2 0 0 0 0 0 0 0.042 0.042
38
Chapter 4
Methodology
The purpose of chapters 2 and 3 was to give some insight of the methods, in a theoretical point of
view, and the importance of them. The goal of the present chapter is to elucidate the reader the way how
these methods were used and how they are connected. An overview of the methodology is presented
in figure 4.1.
Raw Data
PreprocessingModule
Is the data quality
good?
No
YesMachineLearningAlgorithm
AlgorithmHyperparameters
Sampling Training Dataset
Testing Dataset
Numerical Data
Textual Data
Scaler
Vectorization
Vectorization Hyperparameters
Hypothesis Space
HyperparameterOptimization
ModelEvaluation
Modeldescribes the
data?
Final Model
Yes
No
Modelgeneralized?
EndYes
No
Figure 4.1: Overview of the Methodology steps.
39
4.1 Data Preprocessing
In this work, focus was given to textual data preprocessing. However, a preprocessed dataset with
numerical data was also used. The numerical data consists on vital signs and demographic information
collected during triage. In this dataset of numerical data the outliers were treated based on the vital
signs ranges, [97–99], and the method of missing data imputation consisted of the imputation of the
mean value of the triaged population for each variable.
The textual data consists on the main chief complaint which notes the reason the patient was seen.
This clinical note is written by health care professionals to communicate the status of a single patient to
other health care professionals or themselves. A few issues were detected and must be tackled in order
to improve data quality because of the clinical notes particularities:
• The clinical notes do not have the typical structure of a typical written text - Lack of syntactic-
semantic;
• Each clinical note is given by short sentences;
• The notes have several technical and non-technical abbreviations. E.g. ’msd ’ meaning ’membro
superior direito’ (’right upper limb’) and ’2af ’ meaning ’segunda-feira’ (’monday ’);
• These notes are filled with specialized technical terms;
• Context is very important for the meaning of a word;
• There are many numerical values with respect to vital signs;
• Several ways of representing the same information;
• Many misspelled words;
• An huge amount of joined words. E.g., ’textualdata’.
Given the amount of noise sources, it is extremely important to develop a Natural Language Pro-
cessing (NLP) system capable of filtering the noise so that data quality improves leading to a better
data mining process. The flowchart of the chosen sequence of preprocessing steps is depicted in figure
4.2, where the colour purple represents the initial and final data, the decision blocks are in grey, the red
corresponds to the tokenization process, in yellow is depicted the word correction step and the blocks in
blue illustrate the different text normalization steps.
40
Lowercase
Raw Text
TemporalReferences
Normalization
TokenizationCorrectlyNormalized?
No Yes CorrectlyTokenized?
No
AbbreviationExpansion
Word Correction Successfully Corrected?
No
Word Removal
Stemming
Processed
Text
Yes
Yes
Figure 4.2: Overview of the Natural Language Processing framework used to improve the quality of themain chief complaint data.
4.1.1 Text Normalization
As explained in section 3.4 this step is of great importance in order to reduce sparseness in data,
since reducing data sparsity results in less complex models.
The normalization process starts with lowercasing all main chief complaints since lowercasing does
not affect the healthcare sublanguage. Following the lowercase step comes the step of temporal ref-
erences normalization. This step consists on detecting dates and references of months, weeks, days,
hours, minutes and seconds. Starting with the normalization of dates there are some aspects to be
considered:
• A date can be given by the day, month and year;
• A date can be given by the day and month;
• A date can be given by the month and year;
• A date can have spaces between its components and the respective delimiter. E.g., ’02 -mar-17 ’
instead of ’02-mar-17 ’;
• The day can have a ’0’ before the day number (only when the day number is given by one digit);
• The month can be represented by either by its month number (month numbers with only one digit
occasionaly have a ’0’ preceding it), month name, or an abbreviation of month name. E.g., ’8’,
’08’, ’august ’, ’aug’;
• The year can be either represented by a four digits number or a two digits number. E.g., ’2019’,
’19’.
41
Given the existing variability in the representation of a simple date some assumptions were made:
• A healthcare provider always uses the same delimiter for the same date. E.g., ’14-5-2017 ’,
’14/5/2017 ’. ’14.5.2017 ’;
• A healthcare provider may change the type of delimiter during the typing process. E.g., ’patient
operated on clavicle on 12-3-2015 and refers removing tonsils on 4.7.17 ’;
• The format of the date is given by the following format: dd /mm/yyyy (the ’/ ’ delimiter was used just
as an example);
• Only considered years from ’2000’ to ’2016’;
• Only the ’-’, ’/ ’, ’.’ characters are considered as delimiters.
In order to successfully find date matches one resorted to RE and for simplicity only the developed
RE for the ’−’ delimiter will be illustrated since it is only necessary to replace it with one of the other
delimiters. An overview of the different steps when normalizing dates is presented in figure 4.3.
Start
Complete date Match?
Complete DateNormalization
Yes
Month Year date match?
No Day Month date match?
No
Month YearDate
Normalization
Yes
Day MonthDate
Normalization
Yes
EndNo
Chief MainComplaint
Figure 4.3: Normalization sequence steps when dealing with dates.
Given that spaces may be present in a date it is necessary to handle them since the number of
possible combinations between date components is high. ’E.g.’, ’02- 02-2015’, ’02- 02 -2015’, ’02- 02-
2015’, ’02- 02 - 2015’, etc.. In order to deal with this characteristic, when evaluating the existence
of a date in a string the RE, REspacing = (? : (\s)∗) was developed. The workframe starts with the
normalization of complete dates, where the used RE is given by:
REDayREspacing \ −REspacingREMonthREspacing \ −REspacingREY ear
The RE expressions for REDay, REMonth and REY ear are provided in table 4.1. The RE developed
for the days is given by a non-capturing group, characterized by ’? :’ with three alternatives. The first
42
one is given by ’0?[1 − 9] where it matches a single character between 1 and 9, ’[1 − 9]’, even if a zero
exists before it or not, ’0?’. The second alternative, ’[12][0 − 9]’, matches days that start with a one or a
two followed by any digit in the range of 0 to 9, i.e., every day between 10 and 29. The third alternative,
’[3][01]’, matches every day that starts with a three and is followed by a zero or a one, i.e., days 30 and
31. The same strategy was performed in order to deal with the months and years. The RE for the months
is also given by a non-capturing group with 27 alternatives, the first two, 0?[1− 9] and 1[0− 2], are used
to capture months given by their numerical value and the remaining 25 alternatives deal with months
expressed by either their month name or an abbreviation for the month name. The RE for the years is
also given by a non-capturing group with four alternatives where the first two, ’200[0− 9]’ and ’201[0− 6]’,
deal with years represented as a four digit number and the remaining two alternatives, 0?[1 − 9] and
1[0− 6], deal with years given by two digits.
Table 4.1: Regular Expressions (RE) used to capture different date components when date is given bya day, a month and a year.
Regular Expression (RE)
REDay ’(? : 0?[1− 9] | [12][0− 9] | [3][01])’
REMonth ’(? : 0?[1− 9] | 1[0− 2] |janeiro|jan|fevereiro|fev|marco|mar|abril|maio|junho|julho|
jun|jul|julh|agosto|ago|setembro|set|setem|outubro|out|novembro|nov|dezembro|dez)’
REY ear ’(? : 200[0− 9] | 201[0− 6] | 1[0− 6] | 0?[1− 9])’
In the second step, incomplete dates given only by its month and year, are analyzed. In order to
avoid wrong normalizations, due to the overlaps between the two types of incomplete dates, i.e., strings
given by any number from 1 to 12, a delimiter, and any number between 10 and 12, the normalization
process only occurs if at least one of the following scenarios happen:
• Scenario 1 - The month is given by its month name, or a corresponding abbreviation of it, and
followed by a delimiter and a number with either two, from 01 to 16, or four digits, from 2000 to 2016;
• Scenario 2 - The year is either given by a four digit 6number, from 2000 to 2017, or a digit between
13 and 16, and there either exists a preceding number, the month number, between 1 and 12, or
the month name, or a corresponding abbreviation of it.
When dealing with this type of incomplete date, the RE is given by:
REMonthREspacing \ −REspacingREY ear
The expressions for REMonth and REY ear are illustrated in table 4.2 for each scenario.
43
Table 4.2: Regular Expressions (RE) used to capture different incomplete date components when dateis given by a month and a year.
Regular Expression (RE)
Situ
atio
n1 REMonth ’(? : janeiro|jan|fevereiro|fev|marco|mar|abril|maio|junho|julho|
jun|julh|agosto|ago|setembro|set|setem|outubro|out|novembro|nov|dezembro|dez)’
REY ear ’(? : 200[0− 9] | 201[0− 6] | 1[0− 6] | 0?[1− 9]|)’
Situ
atio
n2 REMonth ’(? : 0?[1− 9] | 1[12] |janeiro|jan|fevereiro|fev|marco|mar|abril|maio|junho|julho|
jun|julh|agosto|ago|setembro|set|setem|outubro|out|novembro|nov|dezembro|dez)’
REY ear ’(? : 200[0− 9] | 201[0− 6] | 1[3− 6])’
The final step is thus the normalization of incomplete dates whose elements are given by the day
and the month. In the same fashion as the previous step, the normalization of these types of date only
occurs if one of the scenarios is true:
• Scenario 1 - The month is given by its month name, or a corresponding abbreviation of it, and
preceded by a delimiter and a number with either one or two digits from 1 to 31;
• Scenario 2 - The day is given by a number within the [ 13−31 ] range and there exists either a
following month number ranging from 1 to 12, or its month name, or a corresponding abbreviation
of it.
For this type of format the RE is given by REDayREspacing \−REspacingREMonth, where REDay and
REMonth are displayed in table 4.3.
Table 4.3: Regular Expressions (RE) used to capture different incomplete date components when dateis given by a day and a month.
Regular Expression (RE)
Situ
atio
n1 REMonth ’(? : janeiro|jan|fevereiro|fev|marco|mar|abril|maio|junho|julho|
jun|julh|agosto|ago|setembro|set|setem|outubro|out|novembro|nov|dezembro|dez)’
REDay ’(? : 3[01] | [12][0− 9] | 0?[1− 9]|)’
Situ
atio
n2 REMonth ’(? : 0?[1− 9] | 1[0− 2] |janeiro|jan|fevereiro|fev|marco|mar|abril|maio|junho|julho|
jun|julh|agosto|ago|setembro|set|setem|outubro|out|novembro|nov|dezembro|dez)’
REDay ’(? : 3[01] | 2[0− 9] | 1[3− 9])’
The normalized date structure, at the end of each step, is presented in table 4.4. The rules, according
to the Gregorian calendar, were also taken into consideration during the normalization steps, in order
to have a more robust normalization process. The matching strings that did not satisfy the rules of this
calendar where removed from the chief main complaint since there is no easy way to know how the
written date is suppose to be. E.g., ’31-4’ is an impossible date but was it supposed to be ’31-3’, ’31-5’,
or ’30-4’?
44
Let’s assume that the following sentence is being processed: ’today25- 4-2016is the day of Ana’s
birthday. 25-4 is also the liberty day of Portugal ’. As already mentioned, this date is complete, has
spaces between its elements, and some of those elements are joined with other substrings. The benefit
of the developed RE is that it is possible to match the date, due to the non-capturing groups strategy.
Firstly, when joining date delimiters the string turns to ’today25-4-2016is the day of Ana’s birthday. 25-4 is
also the liberty day of Portugal ’. Secondly the string goes to the second step since there exists a match.
The outputted string is ’today 25abril2016 is the day of Ana’s birthday. 25-4 is also the liberty day of
Portugal ’. Thirdly the string fails to comply with both situation, given that this the string is considered not
to have any incomplete dates given by a month and a year. Finally, the string has information regarding
an incomplete date whose numerical components are given by a day and a month since situation 2
verifies. The output of this step is ’today 25abril2016 is the day of Ana’s birthday. 25abril is also the
liberty day of Portugal ’.
Table 4.4: Outputted normalized date format for each normalization step.
Normalized Date StructureComplete Date ddmonth nameyyyy
Incomplete Date - Month/Year month nameyyyyIncomplete Date - Day/Month ddmonth name
After dealing with dates, the second sub module of temporal references normalization step is
executed. The goal at this stage is to normalize all references to months, weeks, days, hours, minutes
and seconds. It is important to detect these components in a string is because of what is associated
to them, that being an event. In the healthcare sublanguage, the type of event can be anything like
operations, duration of a convulsion, time when the medication was taken, etc.. There are some aspects
that need to be taken into account:
• References to hours may be given only by its hour or by its hour and minute;
• Hours and minutes have different types of delimiters;
• Hours and minutes may have spaces between components;
• These temporal references may be expressed by abbreviations or words with typographical errors;
• These references may be depicting time intervals or the exact moment of the event. E.g., ’took the
medicine at 11:00 a.m.’ and ’took the medicine 1 hour ago’ have different meanings.
Since the vast majority of these references deals mostly with hours and minutes these will be the
only references being described since the remaining types of references follow the same approach,
which consists on several RE illustrated in table 4.5. The first step of the framework is the detection
of references depicting time intervals. In order to disambiguate between these type of references and
the ones that represent the moment of the event one resorted to non-capturing groups with some of the
most common used words, within this scenario, and some abbreviations and misspells of those words.
45
Starting with references containing both hours and minutes components, the RE can be summarized as
the concatenation of the following RE:
REDisambiguation REspacing REHours REspacing REHours Delimiter
REspacing REMinutes REspacing REMinutes Delimiter
REDisambiguation is responsible to detect the most common words used in the studied sublanguage
when the temporal reference describes a time interval, where + imposes that it needs to appear at least
once. REspacing deals with whitespaces between the components, REHours detects hours in the range
of [0−24], REMinutes detects the minutes components from 0 to 59, REHours Delimiter detects the most
frequent delimiters when time is given by hours and minutes, where it was imposed that it needs to match
at least once, and REMinutes Delimiter detects the references to minutes although is not necessary that
it appears in the string since sometimes it is omitted. E.g., ’08h20’.
After the references given by hours and minutes being normalized, the normalization of the remaining
temporal references is performed. The general RE is given by:
REDisambiguation REspacing REi REspacing REi Delimiter
i = {Months,Weeks,Days,Hours,Minutes, Seconds}
Table 4.5: Regular Expressions (RE) used to match temporal references given by months, weeks, days,hours, minutes and seconds.
Regular Expression (RE)REHours ’(? : 2[0− 4] | 1[0− 9] | 0?[0− 9])’REMinutes ’(? : [1− 5][0− 9] | 0?[0− 9])’RESeconds ’(? : [1− 5][0− 9] | 0?[0− 9])’REDays ’(? : 3[01] | [1− 2][0− 9]) | 0?[0− 9])’REWeeks ’(? : 2[0− 4] | 1[0− 9] | 0?[0− 9])’REMonths ’(? : 1[0− 2] | 0?[0− 9])’
REHours Delimiter ’(? : (: |h|h : |hs|hh|hora|horas|he|horae|horase)+)’REMinutes Delimiter (? : (m|min|minuto|minutos)?)
RESeconds Delimiter (? : (segundo|segundos)?)
REDays Delimiter (? : (d|dia|dias)?)
REWeeks Delimiter (? : (semanas|semana|sema|sem)?)
REMonths Delimiter (? : (meses|mes|mes)?)
REDisambiguation (? : (ha|ha|ha|a|a|a)+)’REspacing (? : (\s)∗)
Afterwards the goal is to normalize temporal references that describe the moment when the event
occured. The only difference between the developed RE for this scenario and the previous one is that
the term REDisambiguation does not exist in this one.
Previously, it was explained how to detect if a string has a substring that matches a temporal refer-
ence. The normalization process consists on getting the time period of the event, named as tEvent, and
46
its process is different depending if the information is given by a time period or by the moment of the
event. If a time period is present in the extracted information, ∆t, then it will be subtracted from the mo-
ment when the main chief complaint was created, expressed as tCreation, as depicted in equation 4.1.1.
From the resulting time the hour of the occurrence of the event is extracted and one sees in which day
period it belongs to. If the moment of the event is present in the extracted information, the hour from the
extracted information is used directly and one gets the day period when the event occured. The different
partitions of day periods are illustrated in table 4.6. Suppose that the sentence being processed is given
by ’tomou o medicamento as 11:00’ (’took the medicine at 11:00’). This sentence will fail all matches
of temporal references depicting time intervals. During the next step it will match the first RE since this
reference has both hour and minute as elements. The outputted normalized sentence is thus ’tomou o
medicamento as manha’ (’took the medicine at morning’). Suppose now that the sentence is modified
to ’tomou o medicamento ha 1h atras’ (’took the medicine 1h ago’) and that this main chief complaint
was created in ’25-4-2017 12:00’. This sentence will match during the detection of temporal references
prescribing time intervals and the calculation of the time at which the event happened is 12 − 1 = 11
resulting in an hour within the morning range. This sentence will not match the second set of RE since it
was already treated. The outputted sentence is thus ’tomou o medicamento manha’ (’took the medicine
morning’).
Table 4.6: Partition of the day into six hours day periods.
Day Period Range of HoursDawn 00:00 - 05:59
Morning 06:00 - 11:59Evening 12:00 - 17:59
Night 18:00 - 23:59
tEvent = tCreation −∆t (4.1.1)
The next step of the normalization process deals with the abbreviation expansion technique. When
this step happens one needs to take into account that tokenization has already been done. In order to
know what were the existing abbreviations in the data all of the vocabulary was analyzed manually since:
• The majority of the abbreviations are very technical. E.g., ’tce’, stands for ’traumatismo cran-
ioencefalico’ (’cranioencephalic traumatism’);
• Different abbreviations have the same meaning. E.g., ’dro’ and ’dro’ stand for ’doutor ’ (’doctor ’);
• The same abbreviation may have different meanings depending on its context. E.g., ’dm2’ can
stand for ’diabetes mellitus tipo 2’ (’type 2 diabetes mellitus’) or ’decımetro quadrado’ (’square
decimeter ’).
In order to solve this problems extensive research was done to understand the meaning of those
abbreviations and the context was studied to detect patterns in the different main chief complaints to
47
allow the generation of rules to overcome the ambiguity. These rules work by analyzing the token
neighbours. E.g. regarding the ’dm2’ abbreviation: If the previous token is of type integer or float then
’dm2’ stands for ’decımetro quadrado’. Otherwise it stands for ’diabetes mellitus tipo 2’.
The final technique that was used to normalize the textual data consists on stemming each word in
the vocabulary using the RLSP Stemmer.
4.1.2 Tokenization
In this work sentences were tokenized using a single RE composed by several alternatives. Since
there are misspelled and joined words some considerations were made taking into account the four
operations described at the beginning of section 3.3:
• A word has a maximum of one typo, for computational purposes;
• The insert operation must not be considered since there are joined words.
• The QWERTY keyboard configuration was taken into account to reduce the number of possible
typos.
Given this, each word in the developed corpus for this work was subject to deletion, transposition,
and replacing operations with the QWERTY keyboard constraints in order to obtain all of the unique
words correctly spelled and the ones with only one typo. E.g., some of the results after these operations
on the word ’sol ’ (’sun’) are:
• Deletion - ’ol ’, ’sl ’;
• Replacing - ’dol ’, ’sll ’;
• Transposition - ’slo’, ’osl ’.
After obtaining all possible combinations of words with only one typographical error, the words from
the corpus were added and the abbreviations since some of the sentences have substrings given by
joined words. This words may be any combination of abbreviations, correctly spelled words, or words
with typos. All of the normalized temporal references were also added in order to consider them as they
are. The alternatives of the RE consist of all those words which were were sorted alphabetically and by
length in order to break the sentence where it greedily matched with the longest word.
48
4.1.3 Word Correction
The portuguese language has some special peculiarities that need some attention. It is very com-
mon to have words with diacritical marks, e.g. acute accent, circumflex, cedilla, among other diacritics.
Some of the typos are due to not using this type of marks, e.g., ’pesames’ instead of ’’pesames’ ’ (’condo-
lences’) which yields a Jaro-Winkler distance of djw (s1, s2) = 0.186. Given this, and in order to enhance
this technique in situations where such typos happen, the word comparison function, given by 4.1.2, also
takes into account words in their ASCII encoding representation. By doing this the Jaro-Winkler distance
when the ASCII string is used is djw(s1 =′ pesames′, s2 = ASCII(′’pesames’′)
)= 0.0 yielding a total
Jaro-Winkler distance of 0.5 · 0.186 + 0.5 · 0 = 0.093. The weights were chosen in order to maximize the
minimum similarity value between strings for the threshold, where s1 describes the string being evalu-
ated and s2 corresponds to the string from the corpus being used. The maximum Jaro-Winkler distance
that is considered between two strings is 0.15. The pseudocode 1 illustrates the developed spellchecker
where the Compare function is given by equation 4.1.2.
djw (s1, s2) = 0.5 (1− simjw (s1, s2)) + 0.5 (1− simjw (s1, ASCII(s2))) (4.1.2)
Algorithm 1: SpellChecker to correct misspelled wordsInput: s : String being evaluatedInput: C : List of WordsOutput: s : Corrected String
1 Initialize Dictionary of Corrected Words, CW = dict(); previous similarity = 02 for each word ∈ C do3 if s ∈ C then4 s← s5 Output s6 else if s ∈ CW then7 s← CW [ s ]8 Output s9 else
10 similarity = Compare(s, word)11 if (similarity ≤ 0.15) ∧ (similarity > previous similarity) then12 s← word13 previous similarity ← similarity14 auxiliary ← True15 end16 end17 end18 if auxiliary then19 CW [ s ]← s20 Output s21 else22 s← s23 Output s24 end
49
4.1.4 Word Removal
Despite the lack of a good grammar in this type of data sometimes the healthcare provider uses
linkers and connectors, which in the field of NLP are called stopwords, in a sentence. Since these
stopwords do not provide any information regarding the patient state, they were filtered from the chief
main complaint. It was also necessary to remove words with less than three characters due to some
noise coming from the tokenization process.
Example using the developded Natural Language Processing framework
In order to best understand all the steps an example is provided. The chief main complaint that is
going to be pre-processed is given by: ’DOENTE queixosa difresp dum17.04 abrilamnorreia Hospitald
aluz ’. The following listing summarizes the output of each step and comments are provided when
necessary:
• Lowercasing - ’doente queixosa difresp dum17.04 abrilamnorreia hospitald aluz ’;
• Temporal References Normalization - ’doente queixosa difresp dum 17abril abrilamnorreia
hospitald aluz ’. It verifies condition 2 for incomplete dates expressed by a day and a month and
since whitespacing was not considered as a delimiter the substring ’04 abril ’ would never match;
• Tokenization - [’doente’, ’queixosa’, ’dif ’, ’resp’, ’dum’, ’17abril ’, abril ’, ’amnorreia’, ’hospital ’,
’d ’, ’a’, ’luz ’]→ ’doente queixosa dif resp dum 17abril abril amnorreia hospital d a luz ’;
• Abbreviation Expansion - ’doente queixosa dificuldade respiratoria data ultima menstruacao
17abril abril amnorreia hospital d a luz ’;
• Word Correction - ’doente queixosa dificuldade respiratoria data ultima menstruacao 17abril
abril amenorreia hospital d a luz ’;
• Word Removal - ’doente queixosa dificuldade respiratoria data ultima menstruacao 17abril
abril amenorreia hospital luz ’ (’patient complaining respiratory difficulty date last menstrual period
17april amenorrhea hospital luz ’);
• Stemming - ’doent queix dificuldad respirator dat ult menstru 17abril abril amenorre hospit luz ’.
It is noticeable the good performance of the developed NLP framework in transforming the raw main
chief complaint into a textual document with meaning. The main disadvantage of this framework is that
during Tokenization some noise may be present in data as one can see in the previous example. This
was the main reason for the creation of the Word Removal step. After pre-processing all of the main
chief complaints the data quality increased and it is now ready to be used to develop machine learning
models.
50
4.2 Data Modeling
The dataset was first shuffled and then sampled in a stratified fashion into a training set and a
testing set. Posteriorly a set of pipelines, illustrated in figure 4.4, were developed in order to deal
with the set of predictors used for training and to help during the hyperparameter optimization step.
When the chief main complaint was one of the predictors it is necessary to extract features from the
text data using either BoW or tf-idf. The associated hyperparameters are the range of N-grams, i.e. if
the vectorization takes into account only unigrams, bigrams, combination of unigrams and bigrams, etc.,
and the size of the vocabulary to be considered in order to filter some of the infrequent n-grams resulting
in less complex models. When vital signs were used as predictors the associated features were scaled
using the min-max normalization method, ranging from 0 to 1, in order to reduce data redundancy and
improve its integrity. Feature scaling is suggested for the majority of the considered learning approaches,
moreover MNB and CNB algorithms do not accept negative values for the features. An input ’Grid
of parameters’ is illustrated in the different pipelines of figure 4.4, followed by a Uniform Distribution.
(This is represented since) A Random Search was used with 10-fold cross-validation to choose the
hyperparameters of the considered hypothesis space for the vectorization and learning strategies, with
the goal of maximizing the AUC-ROC score.
Grid ofParameters
UniformDistribution
Set ofParameters
Main ChiefComplaint
FeatureExtraction
(Vectorization)
MachineLearningAlgorithm
(a) Pipeline using the Chief Main Complaint as Predictor.
Grid ofParameters
UniformDistribution
Set ofParameters
ContinuousNumerical Data
MachineLearningAlgorithm
Min-Max Scaler
Categorical Data
(b) Pipeline using numerical and categorical variables asPredictors.
Grid ofParameters
UniformDistribution
Set ofParameters
Main ChiefComplaint
FeatureExtraction
(Vectorization)
MachineLearningAlgorithm
ContinuousNumerical Data
Min-Max Scaler
Categorical Data
(c) Pipeline using the Chief Main Complaint, numerical,and categorical data as predictors.
Figure 4.4: Representation of the used pipelines during the hyper-parameter optimization step withrespect to the used predictors.
51
There is a difference between a model’s parameters and its hyperparameters. Model parameters are
learned during training, and model hyperparameters are set by the designer before training and control
implementation aspects of the model. E.g., the weights learned during training of a LR model are pa-
rameters while the type of kernel of a SVM is a model hyperparameter. Hyperparameters can be thought
of as model settings that need to be tuned for each problem since the best set of hyperparameters for
one particular dataset will not be the best across all datasets. The process of hyperparameter optimiza-
tion means finding the combination of hyperparameter values for a machine learning model in order to
enhance it performance. Random search was used as the strategy to tune the hyper-parameters of the
algorithms since the chosen hypothesis space is considerably large and this approach usually finds a
nearly optimal/optimal set of hyperparameters in few iterations [100]. Random Search chooses the set
of hyperparameters by uniformly selecting the values of each hyperparameter from the specified grid.
In order to evaluate the performance of a learning model given a set of hyperparameters, k-fold cross
validation since with this strategy it is not necessary to separate the training set into two sets, one for
training and the other for validation and data is better used. Another advantage of this technique is that
it should result in models capable of generalizing beyond the training data since one avoids overfitting
to a single and constant validation set. This approach, presented in figure 4.5, consists on splitting the
training set into k equally sized folds where k−1 of those are used for training and the remaining one for
validation. The validation and training sets are always looping. When developing these folds one applied
shuffling and stratified sampling.
Dataset
Trainonk-1stratifiedsplits
Validation
k-Fold
HoldTraining
Set Testing Set
Figure 4.5: Representation of the k-fold cross validation approach.
When sampling the dataset it was considered stratification of the dependent variable. The reason
of such consideration is due to the presence of class imbalance in the dataset as explained in section
5.1. Since the performance of some of the considered algorithms, LR and SVM, can deteriorate with
class imbalance it was applied a cost-sensitive learning strategy. This approach works by making the
learning model aware of the imbalanced data by incorporating the weights of each class, computed as
shown in equation 4.2.1, into the objective function. The weight associated to the less frequent class is
52
bigger when compared with the other class since the weights are inversely proportional to the frequency
of classes in the training data. Given that the weights are incorporated into the objective function and
that the minority class has an associated weight that is superior than the one corresponding to the most
frequent class the learning model is highly penalized when it missclassifies observations corresponding
to the minority class.
wyi =Number of Samples
Number of Classes×∑Nn=1(yi)
(4.2.1)
When using the LR and SVM, the parameters of these learning models, i.e. their weights, are
computed during training. One resorted to local-based methods using CD and SGD in order to see the
influence of the optimization algorithm in the performance of the model [101]. It is important to note
that the developed SVM models used the linear kernel due to the size of the dataset and the number
of features since the time complexity of a non-linear SVM ranges between O(nfeatures × n2samples) and
O(nfeatures × n3samples). The hyperparameters that were tuned for each of the learning algorithms can
be depicted in table 4.7.
Table 4.7: Hyperparameters for each learning strategy: Logistic Regression (LR) using CoordinateDescent (CD) or Stochastic Gradient Descent (SGD), Multinomial Naive Bayes (MNB), Complement
Naive Bayes (CNB), and Support Vector Machine (SVM) with either CD or SGD.
Algorithms Hyperparameters
LR CDC: Inverse of Regularization Strength
Regularization function: L1 or L2
LR SGD
Regularization function: L1 or L2
λ: Regularization termLearning Rate schedule
Number of iterationsMNB α: Additive smoothingCNB α: Additive smoothing
SVM CDC: Inverse of Regularization Strength
Regularization function: L1 or L2
SVM SGD
Regularization function: L1 or L2
λ: Regularization termLearning Rate schedule
Number of iterations
4.3 Model Evaluation
After a model being trained it is necessary to know if it can describe the data and most importantly if
the model is capable to generalize beyond the seen data. During both training and testing bootstrapping
was applied for each performance metric and during testing McNemar’s Hypothesis test was added in
order to compare machine learning models’s error rates. The cut-off probability which presented the
best separation between classes was also computed.
53
Given that during the hyperparameter optimization step the objective was to maximize the ROC-
AUC score one resorted to the Youden’s J index to compute the optimal probability threshold during
the validation step since it guarantees a compromise between sensitivity and specificity [102]. For
each probability threshold value the Youden index is given by equation 4.3.1 and the optimal probability
threshold considered as the cut-off probability is the one that guarantees equation 4.3.2.
J = sensitivity + specificity − 1 (4.3.1)
max(J) (4.3.2)
In order to know the range of values for each performance metric, given a learning model, it was
computed bootstrapped 95% confidence intervals, during validation, with 200 bootstrap samples, as
illustrated in figure 4.6a. Given the chosen percentage for the confidence interval, the percentiles con-
sidered for calculation of the lower and upper bound are thus the α1 = 0.025 and α2 = 0.975 percentiles,
respectively. During testing, bootstrapping was also used to estimate 95% confidence intervals. The
model was subject to a resampled population in order to obtain its estimates, as shown if figure 4.6b.
For the computation of these confidence intervals, 200 bootstrap samples were also used.
When one has models that both represent the data and are capable to generalize well the following
question is raised: From the set of models which one is the best?. In this work one resorted to the
McNemar’s Hypothesis test to help answer that question to first conclude on if the models present
statistical significant error proportions in order to differentiate between them. This was employed only
during the testing step and the null hypothesis and the alternative hypothesis are given by:
• H0 - There is no significance between the models error rates on the test set;
• H1 - The models have a different error rates on the test set.
There may be cases where computing the uncertainty, subtracting the lower bound to the upper
bound of the bootstrapped 95% confidence interval, associated with the performance metrics, given
a learning model, and the result of McNemar’s Hypothesis test are insufficient to discriminate be-
tween models. Given this, some special precautions were taken into account to accommodate with the
limitations already described. In this work a model is selected analyzing:
• Its performance;
• Its uncertainty, where lesser is better;
• Generalization capability;
• Parsimony, in order to have lesser complex models.
54
Population
Resample
MachineLearningAlgorithm
Model
Training Sample Testing Out ofBag Sample
SavePerformance
Reached number ofbootstrappiterations?
End
Yes
No
Start
(a) Bootstrapping strategy employed during the validationstep.
Population
Resample
MachineLearning Model
SavePerformance
Reached number ofbootstrappiterations?
End
Yes
No
Start
(b) Bootstrapping strategy employed during the testingstep.
Figure 4.6: Representation of the used bootstrapping strategies during validation and testing steps.
55
56
Chapter 5
Results
5.1 Database Description
The original dataset contemplates 850 189 entries of patient data, that went to the ED of Hospital
Beatriz Angelo, collected from 2012 to 2016. In order to have a clean dataset for further analysis, the
data was subject to a filtering process, as illustrated in figure 5.1, since:
• Only adults with age of at least 16 years old were considered for the study;
• Some inconsistencies where detected in the chief main complaint field where the discriminator
was used instead;
• In some cases the chief main complaint field was not filled;
• Some adult patients had pediatric discriminators;
• Some of the patients in the sample were retriaged.
As already stated at the beginning of section 4.1 a dataset, already preprocessed, with numerical
and categorical variables was presented. Some of the provided features were engineered, e.g. time
since a patient enters the ED and goes to the triage stage, number of exams that were made, number
of outliers present in data, etc., while others are standard features collected during triage, e.g. temper-
ature, main chief complaint, age, among others. Given this, a total of 40 variables were considered for
this study and are presented in table 5.1. This table is divided into three types of variables: Baseline
variables that contemplate vital signs, demographics, and dummy variables developed from the vital
signs; Text that has the main chief complaint; Additional variables consisting on priority according to
the Manchester Triage System, arrival mode of the patient, triage discriminators, Glasgow coma scale,
Glycemia level, and engineered variables like the number of exams that were made, number of outliers,
the hour, weekday, and month of the visit to the ED, the mean arterial blood pressure, the time that the
patient waited until being subject to triage, and the glycemia.
57
Patients that went to the EmergencyDepartment from 2012 to 2016 (n = 850189)
Adult cohort (n = 612093)
Excluded (n = 238096):
Patient Age < 16 years old
Considered in the present study (n = 511301)
Excluded (n = 100792):
No chief main complaint (n = 2484)Chief main complaint field the same as thediscriminator (n = 67156)Hospital Admission (n = 5328)ReTriageID specified (n = 8512)Patients with pediatric discriminators (n = 9) Intersect with the previously preprocessednumerical dataset (n = 17303)
Figure 5.1: Flow diagram outlining the inclusion and exclusion criteria.
Table 5.1: Predictor variables and outcome used for modelling Hospital Beatriz Angelo (HBA) ED data.
Variables Type
Baseline Age (years) Continuous
18 - 107
Vital signs
Respiratory Rate (RR) (breaths/min)
Heart Rate (HR) (beats/min)
Temperature (Temp) (◦C)
Pulse Oximetry (SpO2) (%)
Systolic Blood Pressure (SBP) (mmHg)
Diastolic Blood Pressure (DBP) (mmHg)
Pain Scale (PS) Categorical
0 - 10
Gender
1 (Female)
0 (Male)
58
Table 5.1: Predictor variables and outcome used for modelling Hospital Beatriz Angelo (HBA) ED data.
Variables Type
Engineered Variables Missing Value in the RR variable (RR in) Categorical
Abnormal Values in the RR variable (RR out)
Missing Value in the HR variable (HR in)
Abnormal Values in the HR variable (HR out)
Missing Value in the DBP variable (DBP in)
Abnormal Values in the DBP variable (DBP out)
Missing Value in the SBP variable (SBP in)
Abnormal Values in the SBP variable (SBP out)
Missing Value in the Temp variable (Temp in)
Abnormal Values in the Temp variable (Temp out)
Missing Value in the Fly variable (Gly in)
Abnormal Values in the RR variable (Gly out)
Missing Value in the Oxi variable (Oxi in)
Abnormal Values in the RR variable (Oxi out)
Missing Value in the PS variable (PS in)
Missing Value in the GCS variable (GCS in)
Text Chief Main Complaint Textual
Additional Manchester Triage System (MTS) Categorical
1 (Emergent)
2 (Very urgent)
3 (Urgent)
4 (Standard)
5 (Non Urgent)
Arrival mode
1 (Walk-in)
2 (Ambulance)
3 (Not registered in system)
Triage discriminators
1 - 118
Glasgow Coma Scale (GCS)
3 - 15
Number of exams
59
Table 5.1: Predictor variables and outcome used for modelling Hospital Beatriz Angelo (HBA) ED data.
Variables Type
0 - 3 or more
Number of missing vitals+pain level
0 - 7
Number of abnormal vital signs
0 - 4 or more
Triage hour
0 - 23
Triage weekday
1 - 7
Triage month
1 - 12
Mean Arterial Blood Pressure (MAP) (mmHg) Continuous
Waiting time for triage (Admn2Tr) (min)
Glycemia (Gly) (mg/dL)
Outcome ED Revisits Categorical
1 (Revisit)
0 (No Revisit)
With respect to the outcome, the original revisits dataset includes patients who have returned to the
ED within 72 hours, in some cases, patients return after 6 seconds of being discharged. Only patients
who returned to the ED after 1 hour of being discharged were considered as returned visits. Patients
who returned in an interval inferior to an hour were patients who: Were hospitalized; had a modification
in some field, e.g. chief main complaint, discriminator, etc.; went into labor; among other reasons. A
total of 28 973 patients satisfied the criteria for being considered as a patient who returned to the ED,
which corresponds to 5.7% of all patients considered for this study.
After filtering the dataset some exploratory data analysis was made by means of data visualization.
Firstly, univariate analysis was performed for a limited set of numerical features, illustrated in figure 5.2,
due to the number of features. The proportion of patients that revisits the ED is very similar between
age groups, with the exception of age group 40-64 where the percentage of revisits is approximately 2%
inferior when compared with the remaining groups as depicted in subfigure 5.2a. The cohort is mainly
composed by female patients but the percentage of revisits is basically the same between genders
as illustrated in subfigure 5.2b. Secondly one resorted to multivariate analysis by means of Pearson
Correlation between each feature and the outcome represented by the heatmap in figure 5.3, where for
60
light blue the correlation is negative (if one value increases the other decreases), dark blue corresponds
to highly correlated features, and cyan blue corresponds to not correlated variables. As one can see,
there is no high correlation between the variables and the outcome but there are some features highly
correlated. E.g., the Mean Arterial Blood Pressure is highly correlated with the Systolic Blood Pressure
(SBP) and Diastolic Blood Pressure (DBP) since MAP can be calculated using both values of SBP
and DBP as MAP = SBP+2×(DBP )3 . As an example of highly negatively correlated variables one
has the engineered features abnormal values and number of missing vitals (n missing vitals) which
seems reasonable since with the increase of missing data there are less abnormal values present in
the data. Focusing now on the textual data, a comparison of the top 20 most frequent N-Grams for
each class, as illustrated in figure 5.4. In order to best understand the meaning of those N-Grams, the
chief main complaint was used before the preprocessing stemming step. There is a total of 26 different
words among the unigrams presented and it is noticeable that the top 20 unigrams for patients that did
not return have more medical terms associated to them, e.g. ’cefaleia’ (’cephalgia’). Giving attention
now to the top 20 bigrams one notices that some bigrams, of patients that have returned to the ED,
represents pregnant female patients. For the other class there are several temporal references and pain
complaints. Finally, the top 20 trigrams of patients revisiting the ED pair with the result of bigrams since
they are associated with pregnant patients. Since the female gender has greater representativity within
the data it is expected a higher number of terms associated with this gender. The terms associated
with females are mostly related with pregnancy. There are approximately 20 555 pregnant patients in
the dataset, corresponding to 4.02% of the cohort. Among these patients, 2 482 were considered as
readmitted within the ED, corresponding to 8.57% of all patients belonging to class 1, i.e. readmitted
patients.
The dataset was partitioned into two sets using shuffling and stratified sampling with respect to the
outcome due to data imbalance. Those sets are the training and testing sets where the training set
comprehends 70% of all the available data, i.e. 357 910 patients, and the testing set has the remaining
30% of the data, i.e. 153 391 patients.
61
01
0
1 0
101
18 - 39
40 - 64
65 - 84
85+
0 - 93.8 %1 - 6.2 %0 - 95.4 %1 - 4.6 %0 - 94.1 %1 - 5.9 %0 - 93.3 %1 - 6.7 %
(a) Distribution of patients by age groups with therespective proportions of revisits (1) and no revisit
(0) for each group.
01
0
1
Male
Female
0 - 94.6 %1 - 5.4 %0 - 94.2 %1 - 5.8 %
(b) Distribution of patients according to gender withthe respective proportions of revisits (1) and no
revisit (0) for each group.
Figure 5.2: Univariate Analysis of age groups and gender with respect to the outcome. Age Groupsaccording to [103–105].
Arriv
al M
ode
MTS
Gend
erAd
m2T
r_m
inTr
iage
Hou
rTr
iage
Mon
thTr
iage
Wee
kday
Tria
ge D
iscrim
inat
ors PS HR DBP
SBP
Tem
pGl
yGC
SOx
iRR MAP
PS_in
HR_in
DBP_
inSB
P_in
T_in
Gly_
inGl
as_in
Oxi_i
nRR
_inHR
_out
DBP_
out
SBP_
out
Tem
p_ou
tGl
y_ou
tOx
i_out
RR_o
utNu
mbe
r of m
issin
g vi
tals
Num
ber o
f abn
orm
al v
itals
Num
ber o
f Exa
ms
Age
Outc
ome
Arrival ModeMTS
GenderAdm2Tr_minTriage Hour
Triage MonthTriage Weekday
Triage DiscriminatorsPSHR
DBPSBP
TempGly
GCSOxiRR
MAPPS_inHR_in
DBP_inSBP_in
T_inGly_in
Glas_inOxi_inRR_in
HR_outDBP_outSBP_out
Temp_outGly_outOxi_outRR_out
Number of missing vitalsNumber of abnormal vitals
Number of ExamsAge
Outcome
0.6
0.4
0.2
0.0
0.2
0.4
0.6
0.8
1.0
Figure 5.3: Pearson Correlation between features, described in table 5.1, and the outcome.
62
02000400060008000Unigram Frequency
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
dortardemanhãurgênciaserviçodesdereferemadrugadaontemgeralapresentanoitedireitoabdominalesquerdohojeíndiceobstétricoqueixascolo
(a) Unigram Frequency for patients who returned to theED.
0 25000 50000 75000 100000 125000 150000 175000Unigram Frequency
tosseregiãoedemaqueixascefaleiaontemhoje
membroabdominal
noitetraumatismomadrugadaapresenta
manhãreferedireito
esquerdodesdetardedor
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
(b) Unigram Frequency for patients who did not return tothe ED.
010002000300040005000Bigram Frequency
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
serviço urgênciaurgência geraldor abdominalíndice obstétriconeste serviçodesde ontemirruptiva oncológicasaco amnióticodiretivas avançadashospital santasanta mariamembro inferioramniótico integrohemibloqueio anteriordesde tardealgias pélvicasdesde manhãtoque colodor lombarvinda serviço
(c) Bigram Frequency for patients who returned to theED.
0 5000 10000 15000 20000 25000 30000 35000Bigram Frequency
direito desdedesde hojedesde noite
algias pélvicasúltima menstruação
dor edematraumatismo craniencefálico
dor torácicaserviço urgência
refere dormembro superior
dor lombardiretivas avançadashipertensão arterial
desde manhãmembro inferioríndice obstétrico
desde tardedesde ontem
dor abdominal
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
(d) Bigram Frequency for patients who did not return tothe ED.
0500100015002000250030003500Trigram Frequency
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
serviço urgência geralneste serviço urgênciahospital santa mariasaco amniótico integrovinda serviço urgênciaontem serviço urgênciaíndice obstétrico 0000data última menstruaçãoviatura médica emergênciamédica emergência reanimaçãoescala coma glasgowtarde irruptiva oncológicasangue tipo arhsanta maria ondeíndice obstétrico 1001segundo tarde irruptivaurgência geral emotivomembro inferior direitoirruptiva oncológica 0000sangue tipo orh
(e) Trigram Frequency for patients who returned to theED.
0 2000 4000 6000 8000Trigram Frequency
diabetes mellitus tipodor membro inferior
apresenta diabetes mellitushospital santa maria
irradiação membro inferioracidente vascular cerebral
índice obstétrico 0000serviço urgência geral
craniencefálico perímetro cefálicotraumatismo craniencefálico perímetro
vinda serviço urgênciaapresenta hipertensão arterial
membro superior direitomédica emergência reanimação
viatura médica emergênciamembro superior esquerdomembro inferior esquerdomembro inferior direitoescala coma glasgow
data última menstruação
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
(f) Trigram Frequency for patients who did not return tothe ED.
Figure 5.4: Comparison between the top 20 most frequent N-Grams (N = 1, 2, 3), before stemming, foreach label.
63
5.2 Model Comparison
The different hypothesis of the considered hypothesis space are the following:
• Baseline - Baseline variables described in table 5.1 were considered;
• All Numeric - The Baseline and Additional variables were used;
• Textual - Only the main chief complaint was used;
• Textual and Baseline - The Baseline variables were used along with the main chief complaint;
• Textual and All Numeric - The Textual variable was used alongside the All Numeric variables.
The developed models using the Baseline variables showed low predictive power, as can be seen
in table A.1, since they present low values regarding the AUC-ROC metric. This suggests that using
only the baseline features generates models that cannot discriminate the true state for each patient in
the cohort and that one should consider more features to improve model performance. The results of
the McNemar’s Hypothesis Test, given a significance level of 5%, it is concluded that the MNB and CNB
models failed to reject the null hypothesis, H0, as illustrated in table 5.7. This means that the error
rate is similar between these models, which was expected since the results are the same. From among
these models, the one that shows a better performance is the LR model using the SGD optimization
technique to compute the weights with a value of AUC−ROC = 0.609 and values of 0.567 and 0.630 for
the sensitivity and specificity measures, respectively.
When one considers the hypothesis of using all of the available continuous and categorical numerical
variables, i.e. using the All Numeric predictor set, one sees an improvement regarding the performance
of the models when compared with the ones that only considered the Baseline variables, as illustrated
in table A.2. Despite the performance improvement there still exists a gap between the trade-off re-
garding sensitivity and specificity since these models show a better sensitivity than specificity. Like
the conclusion regarding McNemar’s Hypothesis Test of the Baseline hypothesis, the only models that
showed similar error rates were the MNB and CNB models, as showed in table 5.7. This conclusion was
expected given that these models presented the same predictive skill. Only resorting to the AUC-ROC
metric there is a tie between the best model, those being the SVM, and, once again, LR models with a
value of AUC−ROC = 0.782. The respectively sensitivity and specificity values are 0.731 and 0.674 for
the LR model and 0.734 and 0.672 for the SVM model. In order to choose a model one resorted to the
uncertainty for each performance metric using the computed 95% bootstrapped confidence intervals as
illustrated in table 5.2
Given this, the best model is the LR model since it has a lower uncertainty regarding its prediction
skill when compared with the SVM model. When one compares the performance of the LR model given
this hypothesis with the one using only the Baseline variables one notes an improvement of 17.3%,
18.7%, and 5.9% for the AUC-ROC, Sensitivity, and Specificity, respectively.
64
Table 5.2: The uncertainty associated with the candidate models under the All Numeric Hypothesis.
All Numeric Hypothesis
Candidate Models AUC-ROC Sensitivity Specificity
LR 78.5%− 77.8% = 0.7% 74.2%− 72.0% = 2.2% 67.8%− 66.3% = 1.5%
SVM 78.5%− 77.8% = 0.7% 76.8%− 72.3% = 4.5% 67.8%− 64.4% = 3.4%
Focusing now on the Textual hypothesis one notes a slight improvement regarding the AUC-ROC
metric but there still exists the gap between sensitivity and specificity as mentioned before. Unlike
the models developed using All Numeric variables, these models demonstrate to better discriminate
patients that did not return to the ED than patients who did since the performance regarding specificity
is higher than sensitivity as can be seen in table A.3. Looking only to the AUC-ROC performance metric,
there are five candidates for being considered the best model. The LR and SVM when both BoW and
tf-idf feature extraction techniques are employed, and the LR with the SGD optimization technique and
tf-idf as the feature extraction approach. Noting also the sensitivity and specificity values one adds
the MNB and CNB when tf-idf is the feature extraction strategy. The results given by the McNemar’s
Hypothesis Test, illustrated in table 5.7, accounts two of the candidate models that failed to reject H0
those being LR when BoW is applied and SVM with tf-idf features with a p-value of 0.459. This meaning
that these two models fail to classify, in different situations, with similar error proportions. Due to this
result, between the LR and SVM models, only the LR with BoW features model will be considered for
analysis regarding the uncertainty, illustrated in table 5.3.
Table 5.3: The uncertainty associated with the candidate models under the Textual Hypothesis.
Textual Hypothesis
Candidate Models AUC-ROC Sensitivity Specificity
LR BoW 79.1%− 78.0% = 1.1% 71.5%− 67.5% = 4.0% 74.9%− 70.8% = 4.1%
SVM BoW 79.1%− 78.0% = 1.1% 70.9%− 67.8% = 3.1% 74.4%− 71.7% = 2.7%
LR tf-idf 79.4%− 78.3% = 1.1% 70.3%− 67.7% = 2.6% 75.0%− 72.9% = 2.1%
LR SGD tf-idf 79.4%− 78.3% = 1.1% 71.2%− 65.9% = 5.3% 77.1%− 71.8% = 5.3%
MNB tf-idf 77.9%− 76.8% = 1.1% 70.4%− 67.0% = 3.4% 74.5%− 71.1% = 3.4%
CNB tf-idf 77.9%− 76.8% = 1.1% 70.2%− 66.8% = 3.4% 76.2%− 73.0% = 3.2%
Given the analysis of the uncertainty with respect to the candidate models, the one that demonstrates
to have a lower level of uncertainty is the LR model with tf-idf features which values for AUC-ROC,
Sensitivity, and Specificity are 0.789, 0.688, and 0.741, respectively. Comparing this LR model with the
best one that uses all numerical variables results in an improvement of 0.7% and 6.7% regarding AUC-
ROC and Specificity, respectively. There is a decline of −4.3% in sensitivity.
Following the Textual and Baseline hypothesis there are no major improvements regarding the
models performance comparing with the ones that only used the chief main complaint as illustrated in
table A.4. As before, analyzing only the values of AUC-ROC, Sensitivity and Specificity for the different
models the ones considered as candidates are the LR, LR with SGD, CNB, and SVM models, using the
65
BoW approach to extract features from the chief main complaint, and the LR, LR with SGD, MNB, CNB,
and SVM models, with tf-idf features. As illustrated in table 5.7, some of the candidate models present
similar error rates as they were not able to reject H0. As previously performed, only some models will
be considered during the uncertainty analysis, illustrated in table 5.4, due to the result of McNemar’s
Hypothesis Test.
Comparing the uncertainty regarding Sensitivity no conclusions can be taken since the results are
very similar between models. Evaluating the uncertainty of Specificity, the SVM model with tf-idf features
is the one that presents less uncertainty. Since this model and the LR with SGD training and tf-idf
features have similar error proportions, as depicted in table 5.7, one needs to compare the uncertainty
of these models, illustrated in table 5.5, to conclude which one is the best under this hypothesis.
Table 5.4: The uncertainty associated with the candidate models under the Textual and BaselineHypothesis.
Textual and Baseline Hypothesis
Candidate Models AUC-ROC Sensitivity Specificity
LR SGD BoW 78.8%− 77.7% = 1.1% 70.5%− 67.1% = 3.4% 75.1%− 71.1% = 4.0%
SVM BoW 79.4%− 78.3% = 1.1% 70.9%− 67.5% = 3.4% 75.4%− 71.8% = 3.6%
LR tf-idf 79.7%− 78.5% = 1.2% 70.0%− 66.8% = 3.2% 76.1%− 72.7% = 3.4%
SVM tf-idf 79.6%− 78.5% = 1.1% 70.2%− 66.7% = 3.5% 76.3%− 73.6% = 2.7%
Table 5.5: Comparison of the uncertainty between candidate models with similar error proportionsunder the Textual and Baseline Hypothesis.
Textual and Baseline Hypothesis
Candidate Models AUC-ROC Sensitivity Specificity
LR SGD tf-idf 79.5%− 78.4% = 1.1% 71.7%− 66.3% = 5.4% 77.5%− 71.3% = 6.2%
SVM tf-idf 79.6%− 78.5% = 1.1% 70.2%− 66.7% = 3.5% 76.3%− 73.6% = 2.7%
Since the uncertainty regarding the LR model with SGD training and tf-idf features is higher than the
SVM with tf-idf features, the last is the best model, under the Textual and Baseline hypothesis, with
values of AUC-ROC, Sensitivity, and Specificity of 0.791, 0.692, and 0.739, respectively. Comparing this
model with the LR model with tf-idf features under the Textual hypothesis there was an improvement
of 0.2% and 0.4% with respect to AUC-ROC and Sensitivity. Regarding Specificity, there is a decline
of −0.4%. As already stated, no significant changes occurred with respect to the performance of the
developed models under the Textual and Baseline hypothesis.
Analyzing the results of the last considered hypothesis, Textual and All Numeric, as illustrated in ta-
ble 5.8, one notes that the gap between the trade-off between Sensitivity and Specificity is much smaller
and the performance under this hypothesis has been largely improved. Following the same approach as
before, the candidate models are the LR, and SVM with BoW features, and the LR, MNB and SVM with
tf-idf features. From the candidates, the ones that showed not having a statistical difference regarding
performance were the LR and MNB when the tf-idf method was used to extract features from the chief
66
main complaint as can be seen in the last row of table 5.7. Given this, between these two models, only
the LR model will be considered when analyzing the uncertainty for each model.
The uncertainty of each candidate model, illustrated in table 5.6, indicates a tie with respect to LR and
SVM when the textual features were extracted resorting to the tf-idf technique. The performance values
of AUC-ROC, Sensitivity, and Specificity are, respectively: 0.842, 0.768, and 0.731, given the LR model,
and 0.855, 0.754, and 0.766 for the SVM. The major difference between these models is the performance
on detecting patients that did not return to the ED and are correctly identified as such. Looking at the
hyper-parameters for each model, and remembering that each N-Gram is a feature, the most complex
model is the SVM since, in order to achieve these results, it was required to use unigrams, bigrams,
trigrams, and the training vocabulary was reduced to 9 500, while the LR model achieved similar results
using unigrams and only 1 000 of the unigrams in the training set were used. Given that parsimonious
models are preferable and that the LR required less features, this model is considered as the best
one. Since there is no statistical difference between the performance of the LR and the MNB models
they need to be compared. Regarding the hyper-parameters, the MNB model using tf-idf features is
more complex than the LR model since it uses unigrams, bigrams and the training vocabulary was only
reduced to 29 000. Given this, the best model under the Textual and All Numeric hypothesis is the LR
model showing improvements of 5.1%, and 7.6% in AUC-ROC and Sensitivity, and a decline of −0.8% in
Specificity when compared to the SVM model with tf-idf under the Textual and baseline hypothesis.
Table 5.6: The uncertainty associated with the candidate models under the Textual and All NumericHypothesis.
Textual and All Numeric Hypothesis
Candidate Models AUC-ROC Sensitivity Specificity
LR BoW 86.0%− 85.1% = 1.1% 78.9%− 75.8% = 3.1% 76.6%− 74.4% = 2.2%
SVM BoW 86.4%− 85.6% = 0.8% 79.9%− 74.7% = 5.2% 78.7%− 73.7% = 5.0%
LR tf-idf 86.4%− 85.6% = 0.8% 77.9%− 75.3% = 2.6% 74.6%− 72.3% = 2.3%
SVM tf-idf 85.9%− 85.1% = 0.8% 77.4%− 74.9% = 2.5% 76.9%− 74.6% = 2.3%
In order to best understand the variations regarding the performance of each model across all of the
considered hypothesis it is illustrated in figure 5.5 six radar plots, where each plot represents a learning
algorithm and its performance for a given hypothesis. It was necessary to partition the hypothesis that
considered textual data into two due to the different feature extraction approaches that were used.
67
Table 5.7: McNemar’s Hypothesis Test results with respect to pairs of machine learning models thatfailed to reject the null hypothesis given a significance level of 5% for the 5 predictor sets.
Model 1 Model 2 P-Value Hypothesis
MNB CNB 1.000 Baseline
MNB CNB 1.000 All Numeric
LR BoN-Grams SVM TF-IDF 0.459TextualLR SGD BoN-Grams SVM TF-IDF 0.120
MNB BoN-Grams LR SGD TF-IDF 0.281
LR BoN-Grams LR TF-IDF 0.210
Textual and BaselineLR SGD BoN-Grams MNB TF-IDF 0.715LR SGD BoN-Grams CNB TF-IDF 0.727
LR SGD TF-IDF SVM TF-IDF 0.149MNB TF-IDF CNB TF-IDF 0.994
MNB BoN-Grams CNB BoN-Grams 1.000
Textual and All NumericSVM SGD BoN-Grams SVM SGD TF-IDF 0.893SVM SGD BoN-Grams CNB TF-IDF 0.227
LR TF-IDF MNB TF-IDF 0.486LR - Logistic Regression; SGD - Stochastic Gradient Descent; MNB - Multinomial Naive Bayes; CNB - Comple-ment Naive Bayes; SVM - Support Vector Machine; BoN-Grams - Bag of N-Grams; TF-IDF - Term Frequency-Inverse Document Frequency
68
BaselineAll NumericTextual BoNGramTextual tf-idf
Textual Baseline BoNGramTextual Baseline tf-idfTextual All Numeric BoNGramTextual All Numeric tf-idf
AUC
SensitivitySpecificity
F1-Score
Precision Kappa
0.2 0.4 0.6 0.8
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.0750.150.2250.3
0.05
0.1
0.15
0.2
0.05
0.1
0.15
0.2
(a) Logistic Regression models for each hypothesis.
AUC
SensitivitySpecificity
F1-Score
Precision Kappa
0.2 0.4 0.6 0.8
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.0750.150.2250.3
0.05
0.1
0.15
0.2
0.05
0.1
0.15
0.2
(b) Logistic Regression models, using StochasticGradient Descent, for each hypothesis.
AUC
SensitivitySpecificity
F1-Score
Precision Kappa
0.2 0.4 0.6 0.8
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.0750.150.2250.3
0.05
0.1
0.15
0.2
0.05
0.1
0.15
0.2
(c) Multinomial Naive Bayes models for each hypothesis.
AUC
SensitivitySpecificity
F1-Score
Precision Kappa
0.2 0.4 0.6 0.8
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.0750.150.2250.3
0.05
0.1
0.15
0.2
0.05
0.1
0.15
0.2
(d) Complement Naive Bayes models for eachhypothesis.
AUC
SensitivitySpecificity
F1-Score
Precision Kappa
0.2 0.4 0.6 0.8
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.0750.150.2250.3
0.05
0.1
0.15
0.2
0.05
0.1
0.15
0.2
(e) Support Vector Machine models for each hypothesis.
AUC
SensitivitySpecificity
F1-Score
Precision Kappa
0.2 0.4 0.6 0.8
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.0750.150.2250.3
0.05
0.1
0.15
0.2
0.05
0.1
0.15
0.2
(f) Support Vector Machine models, using StochasticGradient Descent, for each hypothesis.
Figure 5.5: Comparison between the performance of the Machine Learning models according to eachof the hypotheses.
69
Table 5.8: Results for the machine learning models in test using the main chief complaint and all of numerical variables with their respective hyper-parameters.Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s
Kappa
Bag
ofN
-Gra
ms LR
C = 0.01
Regularization: L2
N-Gram Range: (1, 3)
max features: 25 000
0.856
[0.851, 0.860]
0.778
[0.758, 0.789]
0.748
[0.744, 0.766]
0.261
[0.256, 0.270]
0.157
[0.153, 0.164]
0.184
[0.179, 0.194]
LRS
GD
λ = 10−5
Adaptive Learning
Rate
η0 = 0.0001
Regularization: L1
Number of Iterations:
107
N-Gram Range: (1, 3)
max features: All
Vocabulary
0.842
[0.838, 0.846]
0.762
[0.740, 0.769]
0.734
[0.730, 0.751]
0.246
[0.242, 0.255]
0.147
[0.144, 0.153]
0.167
[0.163, 0.177]
MN
B
α = 0.1
N-Gram Range: (2, 2)
max features: All
Vocabulary
0.826
[0.822, 0.830]
0.740
[0.731, 0.754]
0.740
[0.726, 0.746]
0.244
[0.237, 0.248]
0.146
[0.141, 0.149]
0.165
[0.157, 0.169]
Table 5.8: Results for the machine learning models in test using the main chief complaint and all of numerical variables with their respective hyper-parameters.Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s
Kappa
CN
Bα = 0.1
N-Gram Range: (2, 2)
max features: All
Vocabulary
0.826
[0.822, 0.830]
0.740
[0.731, 0.754]
0.740
[0.726, 0.746]
0.244
[0.237, 0.248]
0.146
[0.141, 0.149]
0.165
[0.157, 0.169]
SV
M
C = 0.001
Regularization: L2
N-Gram Range: (1, 2)
max features: 29 000
0.860
[0.856, 0.864]
0.792
[0.747, 0.799]
0.740
[0.737, 0.787]
0.259
[0.253, 0.283]
0.155
[0.151, 0.175]
0.181
[0.176, 0.211]
SV
MS
GD
λ = 10−5
Adaptive Learning
Rate
η0 = 0.01
Regularization: L2
Number of Iterations:
107
Number iterations no
change: 10
N-Gram Range: (1, 3)
max features: All
Vocabulary
0.840
[0.836, 0.844]
0.763
[0.730, 0.778]
0.724
[0.712, 0.757]
0.240
[0.238, 0.254]
0.143
[0.123, 0.167]
0.160
[0.125, 0.170]
71
Table 5.8: Results for the machine learning models in test using the main chief complaint and all of numerical variables with their respective hyper-parameters.Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s
KappaTF
-IDF
LRC = 0.1
Regularization: L2
N-Gram Range: (1, 1)
max features: 1000
0.842
[0.838, 0.846]
0.768
[0.753, 0.779]
0.731
[0.723, 0.746]
0.246
[0.241, 0.254]
0.147
[0.143, 0.152]
0.167
[0.163, 0.176]
LRS
GD
λ = 10−5
Adaptive Learning
Rate
η0 = 0.001
Regularization: L1
Number of Iterations:
107
N-Gram Range: (1, 1)
max features: 1000
0.829
[0.826, 0.833]
0.772
[0.745, 0.782]
0.698
[0.692, 0.723]
0.227
[0.223, 0.236]
0.133
[0.130, 0.140]
0.144
[0.140, 0.154]
MN
B
α = 0.1
N-Gram Range: (1, 2)
max features: 29000
0.832
[0.828, 0.836]
0.757
[0.745, 0.771]
0.731
[0.722, 0.740]
0.243
[0.237, 0.249]
0.145
[0.140, 0.149]
0.163
[0.158, 0.170]
CN
B
α = 0.1
N-Gram Range: (1, 3)
max features: 29000
0.831
[0.828, 0.835]
0.762
[0.739, 0.777]
0.723
[0.710, 0.740]
0.239
[0.233, 0.246]
0.142
[0.137, 0.148]
0.159
[0.152, 0.167]
72
Table 5.8: Results for the machine learning models in test using the main chief complaint and all of numerical variables with their respective hyper-parameters.Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s
Kappa
SV
MC = 0.01
Regularization: L2
N-Gram Range: (1, 3)
max features: 9500
0.855
[0.851, 0.859]
0.754
[0.749, 0.774]
0.766
[0.746, 0.769]
0.267
[0.255, 0.272]
0.162
[0.153, 0.165]
0.191
[0.178, 0.196]
SV
MS
GD
λ = 10−5
Adaptive Learning
Rate
η0 = 0.001
Regularization: L1
Number of Iterations:
107
Number iterations no
change: 10
N-Gram Range: (1, 1)
max features: 1000
0.824
[0.820, 0.827]
0.738
[0.703, 0.766]
0.716
[0.690, 0.746]
0.229
[0.219, 0.241]
0.136
[0.128, 0.145]
0.148
[0.136, 0.162]
LR - Logistic Regression; SGD - Stochastic Gradient Descent; MNB - Multinomial Naive Bayes; CNB - Complement Naive Bayes; SVM - Support Vector Machine; AUC-ROC
- Area Under Curve-Receiver Operating Characteristic; TF-IDF - Term Frequency-Inverse Document Frequency; η0 - Initial Learning Rate; α - Additive Smoothing; C - Inverse
of Regularization Strength; λ - Regularization term
73
74
Chapter 6
Conclusion
6.1 Summary
In this work, ML techniques were applied to develop a clinical decision support system with the
purpose of predicting adult patient’s ED revisits within 72 hours after discharge. Patients’ physiological
variables, and demographic and chief main complaint information, available at the time of triage, were
used for modeling. In order to extract knowledge, within the chief main complaint, a NLP framework was
developed to preprocess the textual data.
During the data modeling step, several ML models were developed for the purpose of identifying
the most suitable one. To understand the influence on the predictive model performance, five hy-
potheses regarding predictors were made. Two of those hypotheses only contemplated numerical data
(Baseline and All Numeric). One of the hypothesis only used the chief main complaint textual informa-
tion (Textual). The remaining two hypothesis considered both numerical and textual data (Textual and
Baseline and Textual and All Numeric). When the chief main complaint was used as a predictor, two
feature extraction techniques (BoW and tf-idf) were considered with the goal of discern between them.
Due to the imbalance between classes, the models were developed under a cost-sensitive strategy in
order to make the algorithm aware of the imbalance.
In order to avoid model overfitting, the dataset was partitioned into two sets in a randomized and
stratified fashion. One set was used for model development and the other for model testing against
unseen data. Regularization techniques were taken into account and the models were developed using
10-fold cross-validation. Randomized Search was used during the hyper-parameter tuning step to find
the best set of hyper-parameters that maximized the ROC-AUC score. Given the existence of several
probability thresholds that can discriminate between classes, the Youden’s Index was considered in order
to determine the cut-off probability that guarantees a compromise between Sensitivity and Specificity.
During the model assessment phase, several performance metrics were considered to best describe
75
the quality of the model. Due to the stochastic nature of these statistical techniques, 95% confidence
intervals were computed, using bootstrapping, to take into account the uncertainty regarding these mod-
els. As a means to compare ML models, one resorted to McNemars’ Hypothesis Test in order to verify
whether or not the error rate between two models is statistically significant, given a 5% significance level.
6.2 Final Remarks
As already stated, several ML algorithms and sets of predictors were considered in this study and
were compared.
Regarding the Baseline hypothesis, the models developed under this hypothesis were not capable
to discriminate the true state for each patient in the cohort. This is noticeable since the best model, LR
with SGD, verifies a ROC-AUC, Sensitivity, Specificity, F1-Score, Precision and Cohen’s Kappa (κ) of
0.609, 0.544, 0.615, 0.146, 0.0837, and 0.0517, respectively, as shown in table 6.1.
Under the All Numeric hypothesis, the performance of the developed models improves and the
results regarding the best model, LR, are illustrated in table 6.1. The performance measures whose
impact was more significant were the ROC-AUC and Sensitivity, with improvements of 0.173 and 0.187,
respectively. Despite the improvements, there exists a gap between Sensitivity and Specificity since
Sensitivity > Specificity.
Considering the Textual hypothesis, there was a slight improvement regarding ROC-AUC but there
still exists a gap between Sensitivity and Specificity. Unlike the previous hypothesis, under the present
one Specificity > Sensitivity for the best model: LR with tf-idf features.
The relation between Sensitivity and Specificity, in the two previous hypotheses, suggests that using
only numerical data generates models more capable to correctly identify patients that will revisit the ED.
When only the chief main complaint is used, the models better predict patients that will not revisit the ED.
This implies that using both numerical and textual data generates models that are capable of predicting
both classes.
Given the previous statement, appraising the results of the best model under the Textual and Base-
line hypothesis show little to no improvements regarding model’s prediction capability, when compared
with the best model under the Textual hypothesis. Like the previous hypotheses, there exists a gap
between Sensitivity and Specificity concerning the present hypothesis.
Assessing the last hypothesis, Textual and All Numeric, there is a significant increase regarding
ROC-AUC, 0.842 corresponding to an increase of 5, 1%, when compared with the previous hypothesis,
alongside with an achievement of a compromise between Sensitivity and Specificity. Despite the high
values concerning Sensitivity and Specificity, the model’s Precision is low. This means that the model
is capable of predicting the minority class (Revisit) with few False Negatives but the model classifies
several patients belonging to the Revisits class when they did not revisited the ED, i.e. the model
76
produces several False Positives.
Table 6.1: Best results for each hypothesis.
Hypothesis Model ROC-AUC Sensitivity Specificity F1-Score Precision k
HB LR SGD 0.609 0.544 0.615 0.146 0.0837 0.0517
HAN LR 0.782 0.731 0.674 0.204 0.118 0.118
HT LR tf-idf 0.789 0.688 0.741 0.230 0.138 0.149
HTB SVM tf-idf 0.791 0.692 0.739 0.230 0.138 0.149
HTAN LR tf-idf 0.842 0.768 0.731 0.246 0.147 0.167
HB - Baseline; HAN - All Numeric; HT - Textual; HTB - Textual and Baseline; HTAN - Textual and All Numeric;
k - Cohen’s Kappa
Based on all the developed work and what was now stated, one is now capable to answer the three
research questions:
• RQ1 - Does the textual data increase the revisits prediction power?
Answer: Given the obtained results, textual data enhances the prediction of adult patient ED revis-
its within 72 hours after discharge.
• RQ2 - Which ML model and textual feature extraction technique are most suitable for predicting
those group-risk patients?
Answer: From the results, the most suited ML algorithm is the LR when using tf-idf features.
• RQ3 - Which features best describe the risk of ED revisits?
Answer: The hypothesis that had the combination of features that resulted in best results is the
Textual and All Numeric hypothesis.
In accordance with the results and conclusions, there existed relevant information in the main chief
complaint that led to the indication that a patient belonged to a group-risk. Using this data source
alongside patient information, gathered at the moment of triage, resulted in models with good predictive
performance. This work contributed to advance knowledge within the area and indicated a promising
way to develop clinical decision support systems to predict adult patient ED revisits within 72 hours after
discharge.
77
6.3 Future Work
Since the topic of predicting ED revisits is not vastly explored in the literature, there are still oppor-
tunities for further enhancements. The next paragraphs include some of the author’s suggestions for
future work.
Regarding textual feature extraction techniques, it would be interesting to develop a Word Embedding
representation to capture semantic similarity beyond the trivial level of considering language models
based on N-Grams. One could resort to the Word2Vec [106] or Doc2Vec [107] approaches. This word
representations can be trained using several biomedical texts like published works and books.
Considering the low performance regarding Precision (due to class imbalance), it would be intriguing
to consider ensemble methods like bagging or boosting. Ensemble methods improves the prediction ca-
pability of ML methods by combining multiple weak learners. Using ensemble methods allows to produce
better predictions compared to a single model. Another suggestion falls into using Fuzzy Fingerprinting
[108] due to its promising results in authorship identification.
Given the recent advances in the fields of ML and AI, it is appealing to resort to Deep Learning
methods like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), e.g. Long
Short-Term Memory Networks (LSTMNs). CNNs are good with hierarchical or spatial data and extracting
unlabeled features like written characters. LSTMNs are good at temporal or sequential data, like words
in a body of text. LSTMNs are a variant of RNNs that allow for controlling how much of prior training
data should be remembered. One could also combine both networks and develop a CNNs+LSTMNs
architecture which involves using the CNNs layers for feature extraction on input data combined with
LSTMNs to support sequence prediction.
78
Bibliography
[1] S. Ram, W. Zhang, M. Williams, and Y. Pengetnze, “Predicting asthma-related emergency depart-
ment visits using big data,” IEEE Journal of Biomedical and Health Informatics, vol. 19, pp. 1216–
1223, July 2015.
[2] M. Adibuzzaman, P. DeLaurentis, J. Hill, and B. Benneyworth, “Big data in healthcare - the
promises, challenges and opportunities from a research perspective: A case study with a model
database,” AMIA Annual Symposium proceedings. AMIA Symposium, vol. 2017, pp. 384–392,
April 2018.
[3] V. Yadav, M. Verma, and V. D. Kaushik, “Big data analytics for health systems,” in 2015 Interna-
tional Conference on Green Computing and Internet of Things (ICGCIoT), pp. 253–258, October
2015.
[4] M. Wozniak, B. Cyganek, M. Grana, B. Krawczyk, A. Kasprzak, P. Porwik, and K. Walkowiak,
“A survey of big data issues in electronic health record analysis,” Applied Artificial Intelligence,
vol. 30, pp. 497–520, 07 2016.
[5] H.-J. Kong, “Managing unstructured big data in healthcare system,” Healthcare Informatics Re-
search, vol. 25, p. 1, 01 2019.
[6] C. Kruse, R. Goswamy, Y. Raval, and S. Marawi, “Challenges and opportunities of big data in
health care: A systematic review,” JMIR Medical Informatics, vol. 4, p. e38, 11 2016.
[7] S. Meystre, G. Savova, K. Kipper-Schuler, and J. Hurdle, “Extracting information from textual doc-
uments in the electronic health record: A review of recent research,” Yearb Med Inform, pp. 128–
144, 11 2007.
[8] C. Martin-Gill and R. C. Reiser, “Risk factors for 72-hour admission to the ed,” The American
Journal of Emergency Medicine, vol. 22, no. 6, pp. 448 – 453, 2004.
[9] C.-L. Wu, F.-T. Wang, Y.-C. Chiang, Y.-F. Chiu, T.-G. Lin, L.-F. Fu, and T.-L. Tsai, “Unplanned
emergency department revisits within 72 hours to a secondary teaching referral hospital in taiwan,”
The Journal of emergency medicine, vol. 38, pp. 512–7, 11 2008.
79
[10] S. Verelst, S. Pierloot, D. Desruelles, J.-B. Gillet, and J. Bergs, “Short-term unscheduled return
visits of adult patients to the emergency department,” The Journal of Emergency Medicine, vol. 47,
pp. 131–139, 08 2014.
[11] S.-Y. Cheng, H.-T. Wang, C.-W. Lee, T.-C. Tsai, C.-W. Hung, and K.-H. Wu, “The characteristics
and prognostic predictors of unplanned hospital admission within 72 hours after ed discharge,”
The American journal of emergency medicine, vol. 31, 09 2013.
[12] J. A. Gordon, L. C. An, R. A. Hayward, and B. C. Williams, “Initial emergency department diagnosis
and return visits: Risk versus perception,” Annals of Emergency Medicine, vol. 32, no. 5, pp. 569
– 573, 1998.
[13] B. Graham, R. Bond, M. Quinn, and M. Mulvenna, “Using data mining to predict hospital admis-
sions from the emergency department,” IEEE Access, vol. 6, pp. 10458–10469, 2018.
[14] S.-C. Hu, “Analysis of patient revisits to the emergency department,” The American Journal of
Emergency Medicine, vol. 10, no. 4, pp. 366 – 370, 1992.
[15] E. B. Kulstad, R. Sikka, R. T. Sweis, K. M. Kelley, and K. H. Rzechula, “Ed overcrowding is as-
sociated with an increased frequency of medication errors,” The American Journal of Emergency
Medicine, vol. 28, no. 3, pp. 304 – 309, 2010.
[16] C. van Walraven, I. A Dhalla, C. Bell, E. Etchells, I. G Stiell, K. Zarnke, P. Austin, and A. J Forster,
“Derivation and validation of an index to predict early death or unplanned readmission after dis-
charge from hospital to the community,” CMAJ : Canadian Medical Association journal = journal
de l’Association medicale canadienne, vol. 182, pp. 551–7, 03 2010.
[17] F. Ferreira, “Serial evaluation of the sofa score to predict outcome in critically ill patients,” JAMA,
vol. 286, p. 1754, 10 2001.
[18] Z. Obermeyer and E. J. Emanuel, “Predicting the future — big data, machine learning, and clinical
medicine,” The New England journal of medicine, vol. 375, pp. 1216–1219, 09 2016.
[19] E. K. Lee, F. Yuan, D. A. Hirsh, M. D. Mallory, and H. K. Simon, “A clinical decision tool for predict-
ing patient care characteristics: patients returning within 72 hours in the emergency department,”
AMIA Annu Symp Proc, vol. 2012, pp. 495–504, 2012.
[20] D. Hooijenga, R. Phan, V. Augusto, X. Xie, and A. Redjaline, “Discriminant analysis and feature
selection for emergency department readmission prediction,” pp. 836–842, 11 2018.
[21] F. Meng, K. L. Teow, K. Wee Sheng Teo, C. Kheong Ooi, and S.-Y. Tay, “Predicting 72-hour reatten-
dance in emergency departments using discriminant analysis via mixed integer programming with
electronic medical records,” Journal of Industrial & Management Optimization, vol. 15, pp. 947–
962, 04 2019.
[22] G. Pellerin, K. Gao, and L. Kaminsky, “Predicting 72-hour emergency department revisits,” The
American Journal of Emergency Medicine, vol. 36, no. 3, pp. 420 – 424, 2018.
80
[23] W. S. Hong, A. D. Haimovich, and R. A. Taylor, “Predicting hospital admission at emergency
department triage using machine learning,” in PloS one, 2018.
[24] O. M. Araz, D. Olson, and A. Ramirez-Nafarrate, “Predictive analytics for hospital admissions
from the emergency department using triage information,” International Journal of Production Eco-
nomics, vol. 208, pp. 199 – 207, 2019.
[25] X. Zhang, J. Kim, R. Patzer, S. Pitts, A. Patzer, and J. Schrager, “Prediction of emergency depart-
ment hospital admission based on natural language processing and neural networks*,” Methods
of Information in Medicine, vol. 56, 08 2017.
[26] F. Lucini, F. Fogliatto, G. da Silveira, J. Neyeloff, M. Anzanello, R. Kuchenbecker, and B. D. Schaan,
“Text mining approach to predict hospital admissions using early medical records from the emer-
gency department,” International Journal of Medical Informatics, vol. 100, 01 2017.
[27] D. Teubner, J. Considine, P. Hakendorf, S. Kim, and A. D Bersten, “Model to predict inpatient
mortality from information gathered at presentation to an emergency department: The triage in-
formation mortality model (timm),” Emergency medicine Australasia : EMA, vol. 27, 07 2015.
[28] R. A. Taylor, J. R. Pare, A. K. Venkatesh, H. Mowafi, E. R. Melnick, W. Fleischman, and M. K. Hall,
“Prediction of in-hospital mortality in emergency department patients with sepsis: A local big data-
driven, machine learning approach,” Academic Emergency Medicine, vol. 23, no. 3, pp. 269–278,
2016.
[29] W. Chapman, J. Dowling, and M. M Wagner, “Classification of emergency department chief com-
plaints into 7 syndromes: A retrospective analysis of 527,228 patients,” Annals of emergency
medicine, vol. 46, pp. 445–55, 12 2005.
[30] L. Christensen, P. Haug, and M. Fiszman, “Mplus: a probabilistic medical language understand-
ing system,” in Proceedings of the ACL-02 Workshop on Natural Language Processing in the
Biomedical Domain, (Phildadelphia, Pennsylvania, USA), p. 29–36, Association for Computational
Linguistics, Association for Computational Linguistics, July 2002.
[31] D. Thompson, D. Eitel, C. Fernandes, J. Pines, J. Amsterdam, and S. J Davidson, “Coded chief
complaints-automated analysis of free-text complaints,” Academic emergency medicine : official
journal of the Society for Academic Emergency Medicine, vol. 13, pp. 774–82, 08 2006.
[32] Y. Jernite and Y. Halpern, “Predicting chief complaints at triage time in the emergency department,”
2013.
[33] W. Chapman, W. Bridewell, P. Hanbury, G. F. Cooper, and B. Buchanan, “A simple algorithm for
identifying negated findings and diseases in discharge summaries,” Journal of Biomedical Infor-
matics, vol. 34, pp. 301–310, 11 2001.
[34] D. A. Travers and S. W. Haas, “Using nurses’ natural language entries to build a concept-oriented
terminology for patients’ chief complaints in the emergency department,” Journal of Biomedical
81
Informatics, vol. 36, no. 4, pp. 260 – 270, 2003. Building Nursing Knowledge through Informatics:
From Concept Representation to Data Mining.
[35] O. Bodenreider, “The Unified Medical Language System (UMLS): integrating biomedical terminol-
ogy,” Nucleic Acids Research, vol. 32, pp. D267–D270, 01 2004.
[36] J. McCarthy and E. A. Feigenbaum, “In memoriam: Arthur samuel - pioneer in machine learning.,”
AI Magazine, vol. 11, no. 3, pp. 10–11, 1990.
[37] A. L. Samuel, “Some studies in machine learning using the game of checkers,” IBM Journal of
Research and Development, vol. 3, pp. 210–229, July 1959.
[38] T. M. Mitchell, Machine learning. McGraw Hill series in computer science, McGraw-Hill, 1997.
[39] M. Khanam, T. Mahboob, W. Imtiaz, H. Abdul Ghafoor, and R. Sehar, “A survey on unsupervised
machine learning algorithms for automation, classification and maintenance,” International Journal
of Computer Applications, vol. 119, pp. 34–39, 06 2015.
[40] A. Saxena, M. Prasad, A. Gupta, N. Bharill, O. P. Patel, A. Tiwari, M. J. Er, W. Ding, and C.-T. Lin,
“A review of clustering techniques and developments,” Neurocomputing, vol. 267, pp. 664 – 681,
2017.
[41] J. Uthayakumar, T. Vengattaraman, and P. Dhavachelvan, “A survey on data compression tech-
niques: From the perspective of data quality, coding schemes, data type and applications,” Journal
of King Saud University - Computer and Information Sciences, 2018.
[42] S. Agrawal and J. Agrawal, “Survey on anomaly detection using data mining techniques,” Procedia
Computer Science, vol. 60, pp. 708 – 713, 2015. Knowledge-Based and Intelligent Information &
Engineering Systems 19th Annual Conference, KES-2015, Singapore, September 2015 Proceed-
ings.
[43] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” CoRR,
vol. cs.AI/9605103, 1996.
[44] R. J. Mcfarlane, “A survey of exploration strategies in reinforcement learning,” 2003.
[45] A. Y. Ng and M. I. Jordan, “On discriminative vs. generative classifiers: A comparison of logistic
regression and naive bayes,” in Advances in Neural Information Processing Systems 14 (T. G.
Dietterich, S. Becker, and Z. Ghahramani, eds.), pp. 841–848, MIT Press, 2002.
[46] V. Vapnik, “Principles of risk minimization for learning theory,” in Advances in Neural Information
Processing Systems 4 (J. E. Moody, S. J. Hanson, and R. P. Lippmann, eds.), pp. 831–838,
Morgan-Kaufmann, 1992.
[47] H. Zhang, “The optimality of naive bayes,” vol. 2, 01 2004.
[48] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. New York, NY,
USA: Cambridge University Press, 2008.
82
[49] G. H. John and P. Langley, “Estimating continuous distributions in bayesian classifiers,” CoRR,
vol. abs/1302.4964, 2013.
[50] J. D. M. Rennie, L. Shih, J. Teevan, and D. R. Karger, “Tackling the poor assumptions of naive
bayes text classifiers,” in Proceedings of the Twentieth International Conference on International
Conference on Machine Learning, ICML’03, pp. 616–623, AAAI Press, 2003.
[51] B. Boser, I. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifier,” Proceed-
ings of the Fifth Annual ACM Workshop on Computational Learning Theory, vol. 5, 08 1996.
[52] V. VAPNIK, “Pattern recognition using generalized portrait method,” Automation and Remote Con-
trol, vol. 24, pp. 774–780, 1963.
[53] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in
Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, (New
York, NY, USA), pp. 144–152, ACM, 1992.
[54] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, pp. 273–297, Sept.
1995.
[55] T. Wen and A. Edelman, “Support vector machine lagrange multipliers and simplex volume de-
compositions,” 10 2000.
[56] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear
classification,” J. Mach. Learn. Res., vol. 9, pp. 1871–1874, June 2008.
[57] S. Pan, J. Wu, X. Zhu, G. Long, and C. Zhang, “Finding the best not the most: regularized loss
minimization subgraph selection for graph classification,” Pattern Recognition, vol. 48, no. 11,
pp. 3783 – 3796, 2015.
[58] D. Kansagara, H. Englander, A. Salanitro, D. Kagen, C. Theobald, M. Freeman, and S. Kripalani,
“Risk prediction models for hospital readmission a systematic review,” JAMA : the journal of the
American Medical Association, vol. 306, pp. 1688–98, 10 2011.
[59] E. Wallace, E. Stuart, N. Vaughan, K. Bennett, T. Fahey, and S. Smith, “Risk prediction models to
predict emergency hospital admission in community-dwelling adults a systematic review,” Medical
care, vol. 52, pp. 751–65, 08 2014.
[60] A. Turkman, “Statistical intervals: A guide for practitioners and researchers, second edition, by
william q. meeker, gerald j. hahn, and louis a. escobar. wiley series in probability and statistics,
published by john wiley & sons, 2017. total number of pages: 35+592. isbn: 978-0-4716-8717-7,”
Journal of Time Series Analysis, vol. 39, 02 2018.
[61] P. R. Cohen, Empirical Methods for Artificial Intelligence. Cambridge, MA, USA: MIT Press, 1995.
[62] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. No. 57 in Monographs on Statistics
and Applied Probability, Boca Raton, Florida, USA: Chapman & Hall/CRC, 1993.
83
[63] J. H. Zar, Biostatistical analysis. Upper Saddle River, N.J. : Prentice-Hall, 1999.
[64] B. Efron, “Nonparametric standard errors and confidence intervals,” The Canadian Journal of
Statistics / La Revue Canadienne de Statistique, vol. 9, no. 2, pp. 139–158, 1981.
[65] T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning
algorithms,” Neural Computation, vol. 10, no. 7, pp. 1895–1923, 1998.
[66] C. W. Morris, Writings on the General Theory of Signs. The Hague: Mouton, 1971.
[67] V. Raskin, “Linguistics and natural language processing,” in Machine Translation: Theoretical and
Methodological Issues, pp. 42–58, University Press, 1987.
[68] R. M. Kempson and A. Cormack, “Ambiguity and quantification,” Linguistics and Philosophy, vol. 4,
pp. 259–309, Jun 1981.
[69] R. Kittredge and J. Lehrberger, Sublanguage: Studies of language in restricted semantic domains.
04 2015.
[70] R. Grishman, “Adaptive information extraction and sublanguage analysis,” 2001.
[71] S. Wolff, “The use of morphosemantic regularities in the medical vocabulary for automatic lexical
coding,” Methods of information in medicine, vol. 23, pp. 195–203, 11 1984.
[72] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural
language processing [review article],” IEEE Computational Intelligence Magazine, vol. 13, pp. 55–
75, 08 2018.
[73] R. Nallapati, B. Zhou, C. N. dos Santos, Caglar Gulcehre, and B. Xiang, “Abstractive text summa-
rization using sequence-to-sequence rnns and beyond,” in CoNLL, 2016.
[74] J. Hirschberg and C. D. Manning, “Advances in natural language processing,” Science, vol. 349,
no. 6245, pp. 261–266, 2015.
[75] M. A. Attia, “Arabic tokenization system,” in Proceedings of the 2007 Workshop on Computational
Approaches to Semitic Languages: Common Issues and Resources, Semitic ’07, (Stroudsburg,
PA, USA), pp. 65–72, Association for Computational Linguistics, 2007.
[76] J. J. Webster and C. Kit, “Tokenization as the initial phase in nlp,” in Proceedings of the 14th Con-
ference on Computational Linguistics - Volume 4, COLING ’92, (Stroudsburg, PA, USA), pp. 1106–
1110, Association for Computational Linguistics, 1992.
[77] B. Allison, D. Guthrie, and L. Guthrie, “Another look at the data sparsity problem,” in Text, Speech
and Dialogue (P. Sojka, I. Kopecek, and K. Pala, eds.), (Berlin, Heidelberg), pp. 327–334, Springer
Berlin Heidelberg, 2006.
[78] N. Barrett and J. Weber, “Building a biomedical tokenizer using the token lattice design pattern
and the adapted viterbi algorithm,” BMC bioinformatics, vol. 12 Suppl 3, p. S1, 06 2011.
84
[79] J. Grana, M. A. Alonso, and M. Vilares, “A common solution for tokenization and part-of-speech
tagging,” in Text, Speech and Dialogue (P. Sojka, I. Kopecek, and K. Pala, eds.), (Berlin, Heidel-
berg), pp. 3–10, Springer Berlin Heidelberg, 2002.
[80] B. Jurish and K.-M. Wurzner, “Word and sentence tokenization with hidden markov models,” JLCL,
vol. 28, pp. 61–83, 01 2013.
[81] F. A. Shamsi and A. Guessoum, “A hidden markov model -based pos tagger for arabic,” 2006.
[82] T. Mizumoto and R. Nagata, “Analyzing the impact of spelling errors on pos-tagging and chunking
in learner english,” in NLP-TEA@IJCNLP, 2017.
[83] L. La, Q. Guo, D. Yang, and Q. Cao, “Improved viterbi algorithm-based hmm2 for chinese words
segmentation,” in 2012 International Conference on Computer Science and Electronics Engineer-
ing, vol. 1, pp. 266–269, March 2012.
[84] K. Min, W. Wilson, and Y.-J. Moon, “Typographical and orthographical spelling error correction,”
04 2019.
[85] L. Boytsov, “Indexing methods for approximate dictionary searching: Comparative analysis,” J.
Exp. Algorithmics, vol. 16, pp. 1.1:1.1–1.1:1.91, May 2011.
[86] J. P. Carvalho and L. Coheur, “Introducing uws - a fuzzy based word similarity function with good
discrimination capability: Preliminary results,” in 2013 IEEE International Conference on Fuzzy
Systems (FUZZ-IEEE), pp. 1–8, July 2013.
[87] W. Winkler, “String comparator metrics and enhanced decision rules in the fellegi-sunter model of
record linkage,” Proceedings of the Section on Survey Research Methods, 01 1990.
[88] E. H. Porter, W. E. Winkler, B. O. T. Census, and B. O. T. Census, “Approximate string comparison
and its effect on an advanced record linkage system,” in Advanced Record Linkage System. U.S.
Bureau of the Census, Research Report, pp. 190–199, 1997.
[89] Y. Wang, J. Qin, and W. Wang, “Efficient approximate entity matching using jaro-winkler distance,”
in Web Information Systems Engineering – WISE 2017 (A. Bouguettaya, Y. Gao, A. Klimenko,
L. Chen, X. Zhang, F. Dzerzhinskiy, W. Jia, S. V. Klimenko, and Q. Li, eds.), (Cham), pp. 231–239,
Springer International Publishing, 2017.
[90] P. Christen, “A comparison of personal name matching: Techniques and practical issues,” in ‘The
Second International Workshop on Mining Complex Data (MCD’06), 12 2006.
[91] K. Dreßler and A.-C. Ngonga Ngomo, “On the efficient execution of bounded jaro-winkler dis-
tances,” 09 2015.
[92] M. Madkour, D. Benhaddou, and C. Tao, “Temporal data representation, normalization, extraction,
and reasoning: A review from clinical domain,” Computer Methods and Programs in Biomedicine,
vol. 128, pp. 52 – 68, 2016.
85
[93] V. M. Orengo and C. Huyck, “A stemming algorithm for the portuguese language,” in Proceedings
Eighth Symposium on String Processing and Information Retrieval, pp. 186–193, Nov 2001.
[94] A. Kulmizev, B. Blankers, J. Bjerva, M. Nissim, G. van Noord, B. Plank, and M. Wieling, “The
power of character n-grams in native language identification,” in The 12th Workshop on Innovative
Use of NLP for Building Educational Applications, pp. 382–389, Association for Computational
Linguistics (ACL), 2017.
[95] B. Gencosman, H. Ozmutlu, and S. Ozmutlu, “Character n-gram application for automatic new
topic identification,” Information Processing & Management, vol. 50, p. 821–856, 11 2014.
[96] A. M Robertson and P. Willett, “Applications of n-grams in textual information systems,” Journal of
Documentation, vol. 54, pp. 48–67, 01 1998.
[97] M. Plus. https://medlineplus.gov/vitalsigns.html. [Online; accessed 06-May-2018].
[98] Medscape. https://emedicine.medscape.com/article/2172054-overview. [Online; accessed
06-May-2018].
[99] V. Salmasi, K. Maheshwari, D. Yang, E. J. Mascha, A. Singh, D. I. Sessler, and A. Kurz, “Relation-
ship between intraoperative hypotension, defined by either reduction from baseline or absolute
thresholds, and acute kidney and myocardial injury after noncardiac surgery a retrospective co-
hort analysis,” Anesthesiology: The Journal of the American Society of Anesthesiologists, vol. 126,
no. 1, pp. 47–65, 2017.
[100] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. Mach. Learn.
Res., vol. 13, pp. 281–305, Feb. 2012.
[101] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear
classification,” J. Mach. Learn. Res., vol. 9, pp. 1871–1874, June 2008.
[102] W. YOUDEN, “Index for rating diagnostic tests,” Cancer, vol. 3, p. 32—35, January 1950.
[103] M. Knapman MN BHScN GCEd RN and A. Bonner, “Overcrowding in medium-volume emergency
departments: Effects of aged patients in emergency departments on wait times for non-emergent
triage-level patients,” International Journal of Nursing Practice, vol. 16, pp. 310 – 317, 06 2010.
[104] A. Guttmann, M. J. Schull, M. J. Vermeulen, and T. A. Stukel, “Association between waiting times
and short term mortality and hospital admission after departure from emergency department:
population based cohort study from ontario, canada,” BMJ, vol. 342, 2011.
[105] S. Vilpert, S. Monod, H. J. Ruedin, J. Maurer, L. Trueb, B. Yersin, and C. J. Bula, “Differences
in triage category, priority level and hospitalization rate between young-old and old-old patients
visiting the emergency department,” in BMC health services research, 2018.
[106] T. Mikolov, G. Corrado, K. Chen, and J. Dean, “Efficient estimation of word representations in
vector space,” pp. 1–12, 01 2013.
86
[107] Q. V. Le and T. Mikolov, “Distributed representations of sentences and documents,” CoRR,
vol. abs/1405.4053, 2014.
[108] N. Homem and J. P. Carvalho, “Authorship identification and author fuzzy “fingerprints”,” in 2011
Annual Meeting of the North American Fuzzy Information Processing Society, pp. 1–6, March
2011.
87
88
Appendix A
Results
A.1 Baseline Results
89
Table A.1: Results for the machine learning models in test using baseline numerical features with their respectivehyper-parameters. Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’sKappa
LRC = 0.01
Regularization: L2
0.609[0.603, 0.615]
0.544[0.537, 0.572]
0.615[0.607, 0.659]
0.146[0.141, 0.152]
0.0837[0.0801,0.0884]
0.0517[0.0475,0.0603]
LRS
GD
α = 10−5
Adaptive LearningRate
η0 = 0.001
Regularization: L2
Number of Iterations:107
0.609[0.603, 0.614]
0.567[0.545, 0.578]
0.63[0.62, 0.659]
0.146[0.142, 0.15]
0.0838[0.081,0.0866]
0.0524[0.0486,0.0581]
MN
B
α = 0.0010.583
[0.574, 0.588]0.588
[0.574, 0.592]0.546
[0.502, 0.551]0.129
[0.125, 0.131]
0.072[0.0701,0.0736]
0.0305[0.0278,0.0327]
CN
B
α = 0.0010.583
[0.574, 0.588]0.588
[0.574, 0.592]0.546
[0.502, 0.551]0.129
[0.125, 0.131]
0.072[0.0701,0.0736]
0.0305[0.0278,0.0327]
SV
M C = 0.01
Regularization: L2
0.608[0.603, 0.614]
0.574[0.546, 0.587]
0.617[0.607, 0.648]
0.144[0.14, 0.15]
0.0825[0.0799,0.087]
0.0502[0.0468,0.0572]
SV
MS
GD
α = 0.01
Adaptive LearningRate
η0 = 0.1
Regularization: L2
Number of Iterations:107
Number iterations nochange: 10
0.522[0.515, 0.529]
0.552[0.541, 0.562]
0.479[0.476, 0.482]
0.108[0.105, 0.11]
0.0597[0.058,0.0613]
0.00634[0.00405,0.00838]
LR - Logistic Regression; SGD - Stochastic Gradient Descent; MNB - Multinomial Naive Bayes; CNB - Complement Naive Bayes; SVM -Support Vector Machine; AUC-ROC - Area Under Curve-Receiver Operating Characteristic
A.2 All Numerical Results
91
Table A.2: Results for the machine learning models in test using all of the numerical features with their respectivehyper-parameters. Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’sKappa
LRC = 0.01
Regularization: L2
0.782[0.778, 0.785]
0.731[0.72, 0.742]
0.674[0.663, 0.678]
0.204[0.199, 0.207]
0.118[0.115, 0.121]
0.118[0.114, 0.121]
LRS
GD
α = 10−5
Adaptive LearningRate
η0 = 0.0001
Regularization: L1
Number of Iterations:107
0.779[0.775, 0.782]
0.733[0.72, 0.757]
0.664[0.647, 0.675]
0.200[0.195, 0.203]
0.116[0.112, 0.118]
0.113[0.108, 0.117]
MN
B
α = 0.0010.727
[0.723, 0.731]0.704
[0.678, 0.716]0.605
[0.601, 0.625]0.170
[0.166, 0.174]
0.0969[0.094,0.0995]
0.0780[0.0754,0.0822]
CN
B
α = 0.0010.727
[0.723, 0.731]0.704
[0.678, 0.716]0.605
[0.601, 0.625]0.170
[0.166, 0.174]
0.0969[0.094,0.0995]
0.0786[0.0754,0.0822]
SV
M C = 0.0001
Regularization: L2
0.782[0.778, 0.785]
0.734[0.723, 0.768]
0.672[0.644, 0.678]
0.203[0.197, 0.207]
0.118[0.114, 0.121]
0.117[0.111, 0.121]
SV
MS
GD
α = 10−5
Constant LearningRate
η0 = 0.001
Regularization: L1
Number of Iterations:107
Number iterations nochange: 10
0.763[0.759, 0.768]
0.696[0.68, 0.721]
0.683[0.659, 0.701]
0.2[0.194, 0.204]
0.117[0.112, 0.12]
0.114[0.107, 0.119]
LR - Logistic Regression; SGD - Stochastic Gradient Descent; MNB - Multinomial Naive Bayes; CNB - Complement Naive Bayes; SVM -Support Vector Machine; AUC-ROC - Area Under Curve-Receiver Operating Characteristic
A.3 Textual Results
93
Table A.3: Results for the machine learning models in test using only the main chief complaint with their respective hyper-parameters. Results for theperformance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s
Kappa
Bag
ofN
-Gra
ms LR
C = 0.01
Regularization: L2
N-Gram Range: (1, 2)
max features: All
Vocabulary
0.786
[0.78, 0.791]
0.698
[0.675, 0.715]
0.725
[0.708, 0.749]
0.223
[0.215, 0.235]
0.133
[0.127, 0.141]
0.141
[0.132, 0.155]
LRS
GD
α = 10−5
Adaptive Learning
Rate
η0 = 0.0001
Regularization: L1
Number of Iterations:
107
N-Gram Range: (1, 3)
max features: All
Vocabulary
0.783
[0.777, 0.788]
0.685
[0.668, 0.701]
0.739
[0.722, 0.754]
0.227
[0.22, 0.236]
0.136
[0.131, 0.143]
0.147
[0.138, 0.156]
MN
B
α = 0.5
N-Gram Range: (1, 2)
max features: 29000
0.762
[0.756, 0.767]
0.673
[0.661, 0.684]
0.735
[0.724, 0.739]
0.221
[0.214, 0.225]
0.132
[0.127, 0.135]
0.139
[0.133, 0.143]
Table A.3: Results for the machine learning models in test using only the main chief complaint with their respective hyper-parameters. Results for theperformance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s
Kappa
CN
Bα = 0.5
N-Gram Range: (1, 2)
max features: 25000
0.764
[0.758, 0.769]
0.687
[0.673, 0.702]
0.713
[0.703, 0.729]
0.213
[0.207, 0.22]
0.126
[0.122, 0.131]
0.13
[0.124, 0.137]
SV
M
C = 0.01
Regularization: L2
N-Gram Range: (1, 2)
max features: All
Vocabulary
0.786
[0.78, 0.791]
0.692
[0.678, 0.709]
0.73
[0.717, 0.744]
0.225
[0.218, 0.232]
0.134
[0.129, 0.14]
0.143
[0.136, 0.152]
SV
MS
GD
α = 10−5
Constant Learning
Rate
η0 = 0.001
Regularization: L1
Number of Iterations:
107
Number iterations no
change: 10
N-Gram Range: (1, 3)
max features: All
Vocabulary
0.727
[0.719, 0.733]
0.63
[0.605, 0.645]
0.701
[0.685, 0.743]
0.192
[0.186, 0.206]
0.113
[0.109, 0.124]
0.106
[0.0996,
0.123]
95
Table A.3: Results for the machine learning models in test using only the main chief complaint with their respective hyper-parameters. Results for theperformance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s
KappaTF
-IDF
LRC = 0.01
Regularization: L2
N-Gram Range: (1, 2)
max features: 29000
0.789
[0.783, 0.794]
0.688
[0.677, 0.703]
0.741
[0.729, 0.75]
0.23
[0.223, 0.235]
0.138
[0.133, 0.142]
0.149
[0.142, 0.155]
LRS
GD
α = 10−5
Adaptive Learning
Rate
η0 = 0.0001
Regularization: L1
Number of Iterations:
107
N-Gram Range: (1, 2)
max features: All
Vocabulary
0.789
[0.783, 0.794]
0.69
[0.659, 0.712]
0.733
[0.718, 0.771]
0.228
[0.218, 0.244]
0.136
[0.13, 0.149]
0.147
[0.137, 0.166]
MN
B
α = 0.25
N-Gram Range: (1, 2)
max features: 29000
0.774
[0.768, 0.779]
0.693
[0.67, 0.704]
0.715
[0.711, 0.745]
0.216
[0.211, 0.227]
0.128
[0.124, 0.136]
0.133
[0.128, 0.147]
CN
B
α = 0.25
N-Gram Range: (1, 2)
max features: 29000
0.774
[0.768, 0.779]
0.69
[0.667, 0.703]
0.722
[0.706, 0.743]
0.218
[0.211, 0.227]
0.129
[0.124, 0.137]
0.135
[0.128, 0.146]
96
Table A.3: Results for the machine learning models in test using only the main chief complaint with their respective hyper-parameters. Results for theperformance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s
Kappa
SV
MC = 0.0001
Regularization: L2
N-Gram Range: (1, 2)
max features: 25000
0.788
[0.782, 0.794]
0.686
[0.668, 0.702]
0.748
[0.73, 0.762]
0.231
[0.222, 0.241]
0.139
[0.132, 0.147]
0.151
[0.142, 0.163]
SV
MS
GD
α = 10−5
Constant Learning
Rate
η0 = 0.001
Regularization: L1
Number of Iterations:
107
Number iterations no
change: 10
N-Gram Range: (1, 3)
max features: All
Vocabulary
0.742
[0.735, 0.748]
0.655
[0.641, 0.669]
0.689
[0.674, 0.695]
0.191
[0.185, 0.196]
0.111
[0.108, 0.115]
0.104
[0.0987,
0.109]
LR - Logistic Regression; SGD - Stochastic Gradient Descent; MNB - Multinomial Naive Bayes; CNB - Complement Naive Bayes; SVM - Support
Vector Machine; AUC-ROC - Area Under Curve-Receiver Operating Characteristic; TF-IDF - Term Frequency-Inverse Document Frequency
97
98
A.4 Textual and Baseline Results
99
Table A.4: Results for the machine learning models in test using the main chief complaint and baseline variables with their respective hyper-parameters.Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s
Kappa
Bag
ofN
-Gra
ms LR
C = 0.01
Regularization: L2
N-Gram Range: (1, 2)
max features: All
Vocabulary
0.789
[0.783, 0.795]
0.684
[0.673, 0.715]
0.747
[0.717, 0.754]
0.232
[0.218, 0.238]
0.139
[0.129, 0.144]
0.152
[0.136, 0.159]
LRS
GD
α = 10−5
Adaptive Learning
Rate
η0 = 0.5
Regularization: L2
Number of Iterations:
107
N-Gram Range: (1, 3)
max features: 9500
0.783
[0.777, 0.788]
0.69
[0.671, 0.705]
0.725
[0.711, 0.751]
0.22
[0.213, 0.232]
0.131
[0.126, 0.14]
0.138
[0.13, 0.152]
MN
B
α = 0.5
N-Gram Range: (1, 2)
max features: 29 000
0.764
[0.758, 0.768]
0.675
[0.656, 0.697]
0.73
[0.708, 0.756]
0.219
[0.21, 0.231]
0.13
[0.124, 0.14]
0.137
[0.127, 0.152]
Table A.4: Results for the machine learning models in test using the main chief complaint and baseline variables with their respective hyper-parameters.Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s
Kappa
CN
Bα = 0.5
N-Gram Range: (1, 2)
max features: 25 000
0.764
[0.757, 0.768]
0.684
[0.662, 0.7]
0.714
[0.696, 0.743]
0.212
[0.205, 0.222]
0.125
[0.121, 0.141]
0.129
[0.118, 0.141]
SV
M
C = 0.001
Regularization: L2
N-Gram Range: (1, 2)
max features: All
Vocabulary
0.789
[0.783, 0.794]
0.696
[0.675, 0.709]
0.732
[0.718, 0.754]
0.226
[0.22, 0.236]
0.135
[0.131, 0.143]
0.145
[0.138, 0.158]
SV
MS
GD
α = 10−5
Adaptive Learning
Rate
η0 = 0.01
Regularization: L2
Number of Iterations:
107
Number iterations no
change: 10
N-Gram Range: (1, 3)
max features: All
Vocabulary
0.733
[0.725, 0.74]
0.625
[0.612, 0.645]
0.722
[0.696, 0.742]
0.2
[0.191, 0.208]
0.119
[0.113, 0.125]
0.116
[0.105, 0.126]
101
Table A.4: Results for the machine learning models in test using the main chief complaint and baseline variables with their respective hyper-parameters.Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s
KappaTF
-IDF
LRC = 0.1
Regularization: L2
N-Gram Range: (1, 2)
max features: 29 000
0.791
[0.785, 0.797]
0.686
[0.668, 0.7]
0.746
[0.727, 0.761]
0.232
[0.223, 0.242]
0.139
[0.133, 0.147]
0.152
[0.142, 0.163]
LRS
GD
α = 10−5
Adaptive Learning
Rate
η0 = 0.1
Regularization: L1
Number of Iterations:
107
N-Gram Range: (1, 2)
max features: All
Vocabulary
0.789
[0.784, 0.795]
0.687
[0.663, 0.717]
0.741
[0.713, 0.775]
0.229
[0.218, 0.244]
0.137
[0.129, 0.15]
0.148
[0.135, 0.167]
MN
B
α = 0.1
N-Gram Range: (1, 2)
max features: 29 000
0.772
[0.766, 0.777]
0.685
[0.668, 0.701]
0.718
[0.702, 0.736]
0.214
[0.208, 0.223]
0.127
[0.122, 0.133]
0.132
[0.124, 0.141]
CN
B
α = 0.1
N-Gram Range: (1, 2)
max features: 25 000
0.771
[0.765, 0.776]
0.683
[0.686, 0.706]
0.718
[0.692, 0.741]
0.214
[0.206, 0.224]
0.127
[0.121, 0.134]
0.131
[0.121, 0.143]
102
Table A.4: Results for the machine learning models in test using the main chief complaint and baseline variables with their respective hyper-parameters.Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].
Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s
Kappa
SV
MC = 0.01
Regularization: L2
N-Gram Range: (1, 2)
max features: 29 000
0.791
[0.785, 0.796]
0.692
[0.667, 0.702]
0.739
[0.736, 0.763]
0.23
[0.225, 0.242]
0.138
[0.134, 0.147]
0.149
[0.145, 0.164]
SV
MS
GD
α = 0.5
Constant Learning
Rate
η0 = 0.1
Regularization: L1
Number of Iterations:
107
Number iterations no
change: 10
N-Gram Range: (1, 2)
max features: All
Vocabulary
0.729
[0.723, 0.735]
0.625
[0.611, 0.649]
0.712
[0.687, 0.728]
0.195
[0.187, 0.202]
0.115
[0.11, 0.121]
0.11
[0.101, 0.117]
LR - Logistic Regression; SGD - Stochastic Gradient Descent; MNB - Multinomial Naive Bayes; CNB - Complement Naive Bayes; SVM - Support
Vector Machine; AUC-ROC - Area Under Curve-Receiver Operating Characteristic; TF-IDF - Term Frequency-Inverse Document Frequency
103