predicting emergency department revisits using machine ... · for text preprocessing, a natural...

Predicting Emergency Department Revisits usingMachine Learning and Natural Language Processing

Ruben Belo Matias Heitor Mendes

Thesis to obtain the Master of Science Degree in

Mechanical Engineering

Supervisors: Prof. Susana Margarida da Silva VieiraProf. Joao Paulo Baptista de Carvalho

Examination Committee

Chairperson: Prof. Paulo Jorge Coelho Ramalho OliveiraSupervisor: Prof. Susana Margarida da Silva Vieira

Member of the Committee: Prof. Ricardo Daniel Santos Faro Marques Ribeiro

June 2019

Acknowledgments

I would like to thank first Professor Susana Vieira, for her attentive advice in crucial situations.

To Eng. Marta Fernandes, who guided me through the dissertation by sharing her experience and

for being available to discuss ideas, clarify some doubts, and patiently listening to my complaints.

I would also like to express my gratitude to Hospital Beatriz Angelo for providing the data, making the

realization of this work possible.

Finally, a warm thank you to my family, who are the base of my education and values and for sup-

porting me unconditionally. To my friends, for their constant support, honesty, and motivating me in the

most difficult moments.

i

Abstract

The rate of Emergency Department (ED) revisits is often considered as a measure of the quality

of care. The aim of this work is to develop a predictive model that identifies the risk of adult patients’

ED revisits, within 72 hours after discharge. The study data is from Hospital Beatriz Angelo and con-

templates 511301 patients from 2012 to 2016, with an average of 5.7% revisits. Data consists of patient

demographics, vital signs, chief main complaints, and other information available at the time of triage.

For text preprocessing, a Natural Language Processing framework is developed. For data modelling,

Logistic Regression (LR), Support Vector Machine, Multinomial Naive Bayes, and Complement Naive

Bayes techniques are considered. During model development, 10-fold cross-validation is used alongside

a cost-sensitive learning approach. The predictive power of the model is measured by c-statistic.

Five hypotheses regarding features are made. The first hypothesis considers baseline variables,

the second hypothesis considers all numerical variables, the third hypothesis uses the chief main com-

plaints, the fourth hypothesis uses variables of the first and third hypotheses, and the fifth hypothesis

considers variables of the second and third hypotheses.

The best predictive model achieves a c-statistic of 0.842 (95% CI : 0.838 − 0.846), under the fifth hy-

pothesis using the LR technique. The proposed solution shows that favourable predicting performances

can be achieved, indicating a promising way to develop clinical decision support systems to predict

patient ED revisits within 72 hours after discharge.

Keywords: Natural Language Processing, Machine Learning, Emergency Department, Triage, Prediction,

Revisits.

iii

Resumo

A taxa de readmissao ao Servico de Urgencia (SU) e uma medida de qualidade dos cuidados de

saude. O objetivo deste trabalho e desenvolver um modelo preditivo capaz de identificar o risco de

readmissoes de doentes adultos ao SU, ate 72 horas apos alta. Os dados provem do Hospital Beatriz

Angelo e contemplam 511301 doentes, desde 2012 ate 2016, com uma media de 5.7% readmissoes. Os

dados consistem em informacoes demograficas, sinais vitais, queixas principais e outras informacoes

disponıveis aquando da triagem.

Para o pre-processamento de texto, uma estrutura de processamento de linguagem natural e desen-

volvida. Para modelacao de dados, consideraram-se as estrategias: Regressao Logıstica; Maquina de

Vetores de Suporte; Naive Bayes Multinomial; Naive Bayes Complementar. Para desenvolvimento do

modelo utiliza-se validacao cruzada com 10 subconjuntos juntamente com uma abordagem de aprendi-

zagem sensıvel ao custo. O poder preditivo do modelo e medido recorrendo a estatıstica-C.

Sao criadas cinco hipoteses relativamente as variaveis. A primeira hipotese considera variaveis

padrao, a segunda hipotese considera todas as variaveis numericas, a terceira hipotese recorre as quei-

xas principais, a quarta hipotese utiliza variaveis da primeira e terceira hipoteses, e a quinta hipotese

considera variaveis da segunda e terceira hipoteses.

O melhor modelo preditivo atinge uma estatıstica-C de 0.842(95%CI : 0.838 − 0.846), sob a quinta

hipotese, utilizando regressao logıstica. A solucao proposta mostra que desempenhos de previsao fa-

voraveis podem ser alcancados, indicando um caminho promissor para desenvolver sistemas de apoio

a decisao clınica de modo a prever readmissoes ao SU ate 72 horas apos a alta.

Palavras-Chave: Processamento de Lınguagem Natural, Aprendizagem Automatica, Servico de Urgencia,

Triagem, Predicao, Readmissao

v

Contents

Acknowledgments i

Abstract iii

Resumo v

List of Tables ix

List of Figures xii

List of Acronyms xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Predictive Models for Emergency Departments . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Machine Learning 9

2.1 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Learnability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Overfitting and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

vii

2.4 Model Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.2 Bootstrap Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.3 McNemar’s Hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Natural Language Processing 27

3.1 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Word Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6.1 Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6.2 Term Frequency-Inverse Document Frequency . . . . . . . . . . . . . . . . . . . . 37

4 Methodology 39

4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1.1 Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.3 Word Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1.4 Word Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Results 57

5.1 Database Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Conclusion 75

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

viii

6.2 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Bibliography 79

A Results 89

A.1 Baseline Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

A.2 All Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.3 Textual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.4 Textual and Baseline Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

ix

List of Tables

2.1 Dummy example for the development of a Contingency Table. . . . . . . . . . . . . . . . . 25

2.2 Contingency table derived from the dummy example. . . . . . . . . . . . . . . . . . . . . . 25

3.1 Regular Expression (RE) metacharacters description. . . . . . . . . . . . . . . . . . . . . 30

3.2 Dummy Example for Regular Expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Special Sequences description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Bag of Words representation for the dummy example. . . . . . . . . . . . . . . . . . . . . 37

3.5 Term Frequency of each unigram to its corresponding sentence for the dummy example. . 38

3.6 Inverse Document Frequency (idf) of each unigram for the dummy example. . . . . . . . . 38

3.7 Term Frequency - Inverse Document Frequency of each unigram to its corresponding

sentence for the dummy example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 Regular Expressions (RE) used to capture different date components when date is given

by a day, a month and a year. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Regular Expressions (RE) used to capture different incomplete date components when

date is given by a month and a year. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Regular Expressions (RE) used to capture different incomplete date components when

date is given by a day and a month. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Outputted normalized date format for each normalization step. . . . . . . . . . . . . . . . 45

4.5 Regular Expressions (RE) used to match temporal references given by months, weeks,

days, hours, minutes and seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.6 Partition of the day into six hours day periods. . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.7 Hyperparameters for each learning strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1 Predictor variables and outcome used for modelling Hospital Beatriz Angelo (HBA) ED data. 58

xi

5.2 The uncertainty associated with the candidate models under the All Numeric Hypothesis. 65

5.3 The uncertainty associated with the candidate models under the Textual Hypothesis. . . . 65

5.4 The uncertainty associated with the candidate models under the Textual and Baseline

Hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.5 Comparison of the uncertainty between candidate models with similar error proportions

under the Textual and Baseline Hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.6 The uncertainty associated with the candidate models under the Textual and All Numeric

Hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.7 McNemar’s Hypothesis Test results with respect to pairs of machine learning models that

failed to reject the null hypothesis given a significance level of 5% for the 5 predictor sets. 68

5.8 Results for the machine learning models in test using the main chief complaint and all of

numerical variables with their respective hyper-parameters. Results for the performance

metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ]. . . . . . . 70

6.1 Best results for each hypothesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

A.1 Results for the machine learning models in test using baseline numerical features with

their respective hyper-parameters. Results for the performance metrics are illustrated as

Performance [ 95%Bootstrap Confidence Interval ]. . . . . . . . . . . . . . . . . . . . . 90

A.2 Results for the machine learning models in test using all of the numerical features with



A.3 Results for the machine learning models in test using only the main chief complaint with



A.4 Results for the machine learning models in test using the main chief complaint and base-

line variables with their respective hyper-parameters. Results for the performance metrics

are illustrated as Performance [ 95%Bootstrap Confidence Interval ]. . . . . . . . . . . 100

xii

List of Figures

2.1 Different types of machine learning techniques. . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Comparison between the hard-margin SVM and the soft-margin SVM for a toy set. . . . . 16

2.3 Empirical illustration of the fitting-stability tradeoff. . . . . . . . . . . . . . . . . . . . . . . 20

2.4 AUC-ROC Curve for a dummy example using a LR model. . . . . . . . . . . . . . . . . . . 22

3.1 Partition of a general language into specialized languages. . . . . . . . . . . . . . . . . . 28

3.2 Flowchart for the RLSP Stemmer. Adapted from [93]. . . . . . . . . . . . . . . . . . . . . 36

4.1 Overview of the Methodology steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Overview of the Natural Language Processing framework used to improve the quality of

the main chief complaint data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Normalization sequence steps when dealing with dates. . . . . . . . . . . . . . . . . . . . 42

4.4 Representation of the used pipelines during the hyper-parameter optimization step with

respect to the used predictors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.5 Representation of the k-fold cross validation approach. . . . . . . . . . . . . . . . . . . . . 52

4.6 Representation of the used bootstrapping strategies during validation and testing steps. . 55

5.1 Flow diagram outlining the inclusion and exclusion criteria. . . . . . . . . . . . . . . . . . . 58

5.2 Univariate Analysis of age groups and gender with respect to the outcome. Age Groups

according to [103–105]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3 Pearson Correlation between features, described in table 5.1, and the outcome. . . . . . . 62

5.4 Comparison between the top 20 most frequent N-Grams (N = 1, 2, 3), before stemming,

for each label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.5 Comparison between the performance of the Machine Learning models according to each

of the hypotheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

xiii

List of Acronyms

AI Artificial Intelligence. 9, 27, 78

AUC Area Under Curve. xiii, 5, 6, 21, 22, 51, 54, 64–67, 75–77

BoW Bag of Words. 6, 35, 37, 38, 51, 65–67, 75

CD Coordinate Descent. 11, 17, 53

CNB Complement Naive Bayes. iii, 14, 51, 53, 64–66

ED Emergency Department. iii, xi, 2–7, 57–61, 63, 65, 67, 75–78

FN False Negative. 22

FP False Positive. 22

FPR False Positive Rate. 21, 22

LASSO Least Absolute Shrinkage and Selection Operator. 20, 21

LR Logistic Regression. iii, xiii, 4–6, 11, 12, 15, 22, 52, 53, 64–67, 76, 77

ML Machine Learning. 1, 3, 4, 6–9, 15, 21, 24, 35, 75–78

MNB Multinomial Naive Bayes. iii, 6, 13, 14, 51, 53, 64–67

NB Naive Bayes. 4, 12, 13

NLP Natural Language Processing. iii, 1, 3, 6–8, 27, 28, 31–33, 35, 40, 50, 75

RE Regular Expression. xi, 29, 30, 42–48

RLM Regularized Loss Minimization. 20

ROC Receiver Operating Characteristic. xiii, 5, 6, 21–23, 51, 54, 64–67, 75–77

SGD Stochastic Gradient Descent. 11, 17, 53, 64–66, 76, 77

xv

SVM Support Vector Machine. iii, xiii, 4, 6, 15–17, 52, 53, 64–67, 77

tf-idf Term Frequency-Inverse Document Frequency. 14, 37, 38, 51, 65–67, 75–77

TN True Negative. 22

TP True Positive. 22

TPR True Positive Rate. 21, 22

xvi

Chapter 1

Introduction

Nowadays, in the era of Big Data, the amount of digitalized data that is being produced and stored

is evolving exponentially [1]. Despite the production rate of data, Big Data has not been used to its full

potential on the field of health informatics, mainly due to privacy policies of sharing private patient data,

turning the availability of large clinical datasets to researchers an issue [2]. Notwithstanding the privacy

issues, Big Data has begun to assume a significant role supporting the healthcare research, and can

be exploited for patient care by means of risk-stratification and provide assistance for the decision of

medical treatment to be administered, which may lead to cost reduction regarding healthcare facilities

and improve patient outcomes [3].

The generalized adoption of electronic health records has generated a massive quantity of data.

These datasets comprehend heterogeneous data since they contain quantitative (laboratory results),

qualitative (textual notes and demographics) and transactional (billing records) data. Even though this

data is very rich in information it is estimated that 80% of data present in electronic health records is

unstructured [4]. Despite the large proportion of text-based data, it is common for this type of data

to remain untapped after being created. This unfortunate practice is mainly due to the difficulties on

handling unstructured data resulting in ignoring or abandoning this type of data in most healthcare

facilities [5]. Recent advances in Machine Learning (ML) have provided a notable impetus for managing

these datasets and, contrary to traditional statistical methods, demonstrate to be useful when analyzing

unstructured data, such as the chief main complaint [6].

In the biomedical domain, one can partition textual data into two narratives: Biomedical and Clinical

text. The former appears in books, articles, abstracts, and other meticulously treated textual sources that

are written to convey research results. The latter is written by healthcare providers in order to document

clinical events and enable an easier communication among healthcare providers, through description

of a patient status by means of concepts. One way to represent clinical knowledge from clinical textual

sources is by means of Natural Language Processing (NLP) which encodes the clinical information

present in the textual documents. Nevertheless, clinical text presents various challenges in the NLP

1

field. These textual data sources are not meticulously treated like the ones of the biomedical narrative.

The lack of rigor results in short ungrammatical documents, without a standard terminology, that are

overloaded with abbreviations, locally developed terms, and other shorthand textual characteristics, that

are intentionally written for an easier communication among healthcare providers [7].

1.1 Motivation

Healthcare facilities are very complex and stressful environments, with great demand, where de-

cisions need to be taken quickly. EDs are responsible for the provision of emergency medicine and

surgical care to patients arriving at the hospital in need of immediate care. Due to unplanned patient

attendance to the ED, arriving either by patient’s own means or by ambulatory service, the ED must

provide initial treatment for a wide-ranging of illness and injuries, where some may require immediate

treatment. In order to differentiate patients according to severity, patients are subject to triage when they

attend the ED. Commonly, the ED operates 24 hours a day and the level of healthcare providers may

be different according to the time period or year season, in an attempt to provide the best possible care

according to patient volume.

Presently, ED services have a high demand, which is increasing, and are subject to patient revisits

to the ED. This situation leads to overcrowding within the ED which may increase the length of stay of

patients resulting in stressful situations. Revisits are often considered as a measure of healthcare quality

which consumes medical resources and may represent issues regarding patient safety. By predicting

the risk associated with whether a patient will visit the ED again, within 72 hours after being discharged

from the ED, clinical interventions may be taken in order to prevent revisits since the knowledge of a

patient belonging to a risk group may guide healthcare providers to reevaluate a patient status before

being discharged from the ED.

There are several studies focused on detecting risk factors that may lead to an ED revisit, including

diagnosis, main chief complaints discriminators, and patient demographics [8]. The abdominal pain and

fever main chief complaint discriminators where the most common causes for patient revisit in [9]. In [10]

and [11], the main reason for revisiting the ED was due to misdiagnosis, where [10] also concluded that

fatigued healthcare providers provided inadequate medical care. [12] focused on diagnostic predictors

and concluded that Dehydration, Septicemia, Abdominal pain, Seizure, Asthma, Urinary tract infection,

and Pneumonia were the more useful initial diagnosis predictors of an ED revisit. Despite the amount

of studies dedicated in identifying risk-factors related with ED revisits, a very limited number of studies

regarding their prediction were carried out.

Since ED revisits cause unnecessary overcrowding in the facility, and crowding within the ED results

in significant negative consequences for both patients and the healthcare facilities [13–15], being capa-

ble of predicting ED revisits helps decreasing the rate of ED revisits. This results in: Better management

of the ED resources; Overcrowding prevention; Medical care cost reduction; Patient satisfaction regard-

2

ing the provided service; Fatigue management of the healthcare providers (less stressed and exhausted

staff); Enhance quality of emergency care.

Most computer-based algorithms in healthcare are rule-based expert systems that encode knowl-

edge on a particular field, whose goal is to provide the healthcare providers some conclusions about

a specific clinical scenario. These systems have as basis the principles of medicine. One example in

healthcare is the LACE index, which provides the 30-day readmission risk of a patient. The computation

of the LACE index is fairly simple, it is based on length of stay, acuity of the admission, patient comor-

bities, and ED visits within the last six months [16]. There exist similar tools like the sequential organ

failure assessment score, known as SOFA score, as it is computed using Partial Pressure of Oxygen,

Fraction of inspired Oxygen, Platelets, Glasgow Coma Scale, Bilirubin, Mean arterial pressure and Cre-

atinine values and knowing if a patient is under mechanical ventilation. This system predicts the clinical

outcomes of critically ill patients and is used to track the status of a patient during the stay in an intensive

care unit to determine the organ rate of failure of a patient [17].

Given the existence of such expert systems, one may argue that the development of ML models

is unnecessary. The LACE index was developed using a cohort from 2004 to 2008 based on Ontario,

Canada demographics. In order to successfully use this scoring system one should know if the patient

demographics being studied closely matches those used from Ontario. Another issue regarding the

LACE score is its computation requirements. LACE requires the length of stay of a patient, meaning that

this score is only applicable after patient discharge. This is where ML excels since it learns the important

relationships from patient observations that reliably predict outcomes. This means that ML models are

customized on the provided data and the designer has control on the used data when developing risk

prediction models. E.g., data can be used upon patient admission rather than discharge. ML also fills

the gap on handling an huge amount of predictor variables allowing the use of new kinds of data.

Recently, the Institute of Medicine conducted a study and concluded that the frequency of diagnostic

errors is alarmingly high and interventions to reduce such errors is minimal [18]. With the increase

of electronic health records, ML algorithms will suggest high-value tests, and reduce overuse of testing

causing cost reductions in healthcare facilities. This will be developed at a slow pace since the developed

ML models need to be built, and validated individually, for each clinical case and the majority of data is

unstructured making them inaccessible to algorithms without prior preprocessing and transformations.

This is where the use of NLP comes trough, making possible the extraction of knowledge of high-valued

textual data in electronic health records, enhancing the risk prediction performance.

3

1.2 Predictive Models for Emergency Departments

In the literature we may find some related work dedicated to the development of predictive models

focused on EDs that used information at the time of triage. The work of [19] was focused on developing

a clinical decision support system whose objective was the identification of discriminatory characteris-

tics that can predict Pediatric ED revisits, within 72 hours after a patient being discharged. Initially, the

authors considered 96 factors which included the corresponding discriminator of the main chief com-

plaint, physician diagnosis, 5 factors with respect to patient demographics, 8 factors regarding the way

of patient arrival to the Pediatric ED, 35 factors related to the hospital environment, and 44 factors with

respect to medical treatment. The ML framework focused on the wrapper approach, where feature

selection was used by means of particle swarm optimization techniques. This was combined with an

optimization-based Discriminant Analysis via Mixed Integer Program model, named DAMIP, to generate

classification rules, based on small subsets of discriminatory factors, that can be used to predict Pe-

diatric ED revisits. Cross-validation was considered so that the algorithm searched trough the subsets

of discriminatory factors space. This work resulted in a model whose predictive accuracy is over 80%

and the most important factors for a Pediatric ED revisit were: Diagnosis; Discriminator ; The type of

Provider.

The previous work paved the way for future research regarding the ED revisits case study. In [20]

the main difference between the objective of this work and the previous one is that the present one

focused on the general ED. The authors considered 44 factors, of which 22 features represented the

diagnosis and were binary, indicating if the diagnosis in a certain category was given. The remaining

ones, described in the work, included age, gender, arrival mode, urgency level, and length of stay. The

ML framework also focused on the wrapper approach, where three feature selection techniques were

explored: Particle swarm optimization (as the previous work); Tabu search; Random search. The feature

selection technique was combined with ML algorithms where the authors also considered the DAMIP

approach for the generation of classification rules. For comparison purposes they also considered linear

discriminant analysis, NB, SVM, LR, nearest shrunken centroid, and neural network algorithms. During

model training, under-sampling of the majority class (No Revisit) was performed to improve the prediction

of the minority class (Revisit). This research concluded that from the different classification strategies

only the NB and nearest shrunken centroid were capable of classifying some patients from the minority

class. The sensitivity, specificity, precision, and F1-Score for the NB and nearest shrunken centroid

models were given by 28.7%, 86.4%, 9.9%, 14.8%, and 67.5%, 48.4%, 6.4%, 11.7%, respectively. Given

the low scores the authors suggest that these models should not be considered by healthcare providers

as a clinical decision support system. Among the three feature selection approaches, the one that

showed better results was the metaheuristic tabu search coupled with the DAMIP model. From the

44 initial features only 19 were chosen as predictors but the authors did not specify which ones were

selected and the DAMIP model with Tabu search feature selection generated a model with a sensitivity,

specificity, precision, and F1-Score of 69.0%, 62.7%, 8.9%, 15.7%, respectively.

4

Similarly to the previous studies, [21] also focused on developing a clinical decision support system

whose objective was the identification of discriminatory characteristics that can predict adult ED revisits,

within 72 hours after a patient being discharged. The authors considered about 140 factors including:

Patient demographics; Chief complaints; International Classification of Diseases, 9Th Revision, Clinical

Modification descriptions; Medical history; Laboratory tests; Handover; Vital signs; Language; Arrival

mode; Admission time; Admission date; Acuity category; Social issues. The patient demographics in-

cluded age, gender, ethnicity, nationality, and residential status. The chief complaints were classified in

18 categories, i.e. the discriminator associated to the main chief complaint was used. The acuity cate-

gory was characterized using four levels. The number of handovers only considered doctors. Medical

history consisted on all known medical conditions a patient had. Social issues considered generalized

weakness, severe social and financial issues, homeless, family violence, among others not described

in the study. In this study, a pre-selection of risk factors was performed using univariate analysis in

order to derive statistically significant factors. Afterwards, feature selection using filter methods was

performed, on the set derived from univariate analysis, to generate sets of risk factors. The authors

explored different filter methods such as joint mutual information, mutual information feature selection,

conditional mutual information maximization, (max-relevance min-redundancy, interation capping, con-

ditional infomax feature extraction, double input symmetrical relevance, conditional mutual information,

and conditional redundancy. For each filter method, the top 15 factors were considered from which 9

sets of selected factors were obtained in order to obtain the 10 most important factors. As in the two

previous works, DAMIP was considered to generate classification rules and for comparison reasons LR

was used, where 10-fold cross validation was used during model training. This study resulted in a DAMIP

model with an overall accuracy and sensitivity of 72.6% and about 40%, respectively. Regarding the LR

model, the ROC-AUC score was 66.0% and several results of sensitivity and specificity were presented,

according to several cut-off probabilities. When the 5% probability cut-off was considered, the accuracy

regarding sensitivity and specificity was 51.0% and 70.8%, respectively, resulting in an overall accuracy

of 69.9%. For a probability threshold of 70% the accuracy regarding sensitivity and specificity was 1.0%

and 99.9%, respectively, resulting in an overall accuracy of 95.5%. Among the initial 140 factors the 10

most important factors in this study, sorted from most to least important, were: Social issues; Chronic

obstructive pulmonary disease; Substance misuse - Chief Complaint Discriminator; Neurotic disorders;

Open wound lower limb; Handover; Respiratory - Chief Complaint Discriminator; Open wound upper

limb; Fracture upper limb; Open wound of head, neck, and/or trunk.

In the work of [22], the objective was the development of a clinical decision support system to identify

patients in group-risks of revisiting the ED within 72 hours after discharge. The used data came from

Veterans Healthcare Network Upstate New York and consists on: Demographics - age, gender, maritial

status, race, period of military service, and disability rating; Socioeconomic - income, homeless, insur-

ance status; Prior ED utilization - ED revisit within 72 hours, ED revisit within 30 days, number of ED

visits, number of primary care visits, number of tele-health encounters, total outpatient visits, number of

hospitalizations, total cost; Comorbidities - 285 clinically homogeneous groups. The adopted modelling

algorithm was the multivariate LR and the final model was developed with stepwise methods variables

5

with p-values less than 0.05 to avoid overfitting. For model validation the authors used the split-sample

method where 23 of the data was used for model development and the remaining for validation. The

authors considered three sets of predictors to demonstrate model predictive power according to each

considered hypothesis regarding independent variables. The first hypothesis considered demographic

and socioeconomic variables. The second hypothesis used the variables of the first hypothesis and the

Prior ED utilization variables. The third, and final, hypothesis considered all variables. Model predic-

tive power was measured using ROC-AUC whose values, in validation, for the first, second, and third

hypothesis are 0.54, 0.70, and 0.73, respectively.

Despite the number of published works regarding ED revisits being little, there are several studies

that consider the ED and data produced from this environment. The works of [23] and [24] used triage

and patient demographics information to develop ML models to predict hospital admissions at the time

of ED triage. Both studies considered LR and Extreme gradient boosting algorithms. The former also

considered deep neural networks, the latter also took into account artificial neural networks, decision

trees, random forests, and SVM (linear kernel) algorithms. Still focused on prediction models to predict

hospital admission at the time of triage, the works of [25] and [26] differ from the two previous ones since

these works considered unstructured data and developed NLP frameworks to preprocess the textual

data in order to be used as a predictor for model development. Both works considered the LR algorithm.

The former also used multilayer neural network models, whereas the latter considered the decision trees,

random forests, extremely randomized tree, AdaBoost, MNB, SVM (linear kernel), and Nu-SVM (linear

kernel) models. The works of [27] and [28] had as objective the development of ML prediction models of

patient mortality using information available at ED triage. The former used multivariate LR and the latter

considered the random forest and classification and regression tree algorithms.

Textual data coming from the ED has several work dedicated to it. Resorting to NLP techniques ap-

plied to the main chief complaint textual data, the work of [29] had as objective classifying the free-text

present in the main chief complaint into seven syndromic categories: Respiratory; Botulinic; Gastroin-

testinal; Neurologic; Rash; Constitutional; Hemorrhagic. The model was based on the M+ system

which is a robust chart-based semantic model with a Bayesian network based semantic model for ex-

tracting information from narrative patient records [30]. The developed system was capable to, for most

syndromes, classify about half of the patients main chief complaint into syndromic categories with speci-

ficities higher than 90% for all syndromic categories, and sensitivities ranging from 12% to 44%. The work

of [31] focused on coding the main chief complaints according to 228 possibilities, i.e. the corresponding

discriminator related with the main chief complaint. The authors developed the Coded Chief Complaints

for Emergency Department Systems classification schema which is a text-parsing algorithm that reads

the main chief complaint and classifies it into 1 of the 228 coded possibilities. In [32] a system capable

of predicting the main chief complaint of a patient according to patient status, i.e. vital signs and state at

ED arrival, was developed. The system consists on a linear SVM trained using the BoW, of triage notes,

approach where negation was taken into account (NegEx [33]; A NegEx-like system where some rules

were added; A perceptron classifier.) so that the model did not consider those negated complaints as

6

candidates of being the main chief complaint. In [34] focus was given into building a concept-oriented

terminology regarding nursing language entries since a standard vocabulary does not exist for the chief

main complaint. The authors used NLP techniques and the Unified Medical Language System [35]. The

used NLP techniques consisted on normalization, tokenization, stemming, word sense disambiguation,

word lookup, spelling correction, and abbreviation expansion.

1.3 Objectives and Contributions

The first goal of this study is to advance knowledge within the healthcare area by applying ML meth-

ods to healthcare data, structured (demographics, vital signs, dummies) and unstructured (chief main

complaint). This data is used in order to develop a predictive model that identifies the risk of adult pa-

tients’ ED revisits, within 72 hours after discharge. While identifying these type of patients, decisions

regarding their care can be taken, in a timely manner, to reduce the risk of patient revisit, improve pa-

tient safety, and reduce the healthcare facility cost. Given this, five hypotheses regarding the predictors

were made. The first consists on using Baseline triage variables. The second uses the variables of the

first hypothesis with some additional variables (named All Numeric). The third hypothesis,Textual, only

considers the chief main complaint. The fourth uses variables from the first and third hypotheses and

is named Textual and Baseline. The fifth and final hypothesis, Textual and All Numeric considers all

the variables, i.e. variables from the second and third hypotheses. For a full description of the variables

the reader is suggested to table 5.1.

The second goal is to see if the chief main complaint has relevant information that leads to the

indication that a patient belongs to a group-risk, since for gaining knowledge about the causes for a

patient ED revisit, it is important that the predictive model identifies factors that are critical for prediction.

The third goal is to find what is the best pair of ML model to use for predicting patient ED revisits

coupled with the best textual feature extraction technique.

To cover all aspects, three research questions (RQ) were formulated:

• RQ1 - Does the textual data increase the revisits prediction power?

• RQ2 - Which ML model and textual feature extraction technique are most suitable for predicting

those risk-group patients?

• RQ3 - Which features best describe the risk of ED revisits?

Regarding the contributions, to the best of the author’s knowledge, this is the first study that focuses

on developing a prediction model whose goal is to predict the risk of adult patients ED revisits resorting

to ML techniques to ED triage data and applying NLP on the chief main complaint to use it as a predictor

for the development of prediction models.

7

1.4 Thesis Outline

This document is composed by 6 chapters including the present one. The remaining document is

organized as follows:

Chapter 2 addresses ML algorithms and how to evaluate such ML models. Some explanation re-

garding concerns when working with ML is also provided.

Chapter 3 describes the difficulties when dealing with (sub)language and explains, with several

examples, techniques to overcome them.

Chapter 4 presents in detail the chosen methodology in this work. It starts by describing the pre-

processing of textual data, the developed NLP framework, and at the end an example using real data

is presented. The chosen data modeling strategy follows and the chapter ends describing how models

were evaluated and selected.

Chapter 5 starts with the description of the database and how it was handled in order to be used.

Afterwards, a summary of the main results is provided.

Chapter 6 summarizes all the methodologies implemented throughout the work, it presents the main

conclusions drawn from it and finally it exposes the recommendations for further study that can be

performed.

8

Chapter 2

Machine Learning

Arthur Lee Samuel (1901-1990), considered one of the pioneers in the field of Artificial Intelligence

(AI) [36], conceived one of the earliest definitions of Machine Learning (ML) stating, in [37], the following:

”A computer can be programmed so that it will learn to play a better game of checkers than can be

played by the person who wrote the program.”

Tom Michael Mitchell proposed a more formal and broad interpretation of learning: ”A computer

program is said to learn from experience E with respect to some class of tasks T and performance

measure P, if its performance at tasks in T, as measured by P, improves with experience E.” [38]

Combining these two definitions one can interpret ML as a computer science theory that provides

systems the ability to automatically learn and to progressively improve performance from experience

without being explicitly programmed.

The process of learning begins with observations, in order to look for patterns in data, by means of

statistical techniques, and make better decisions in the future based on the examples that we provide.

The primary aim is to allow the computers learn automatically without human intervention and adjust

actions accordingly.

ML approaches are usually partitioned in three different types: Supervised, Unsupervised and Re-

inforcement Learning. Figure 2.1 illustrates those partitions and the different problems tackled by each

ML technique.

Supervised Learning methods are techniques that pursue to find the relationship between input vari-

ables, also referred as features or independent variables, and a target attribute, typically mentioned as

outcome or dependent variable, that can be either discrete or continuous. The mapping function can be

generally described as y = f(x), where y is the dependent variable and x is the independent variable

and depending on the function approximation task one can be dealing with a classification or regres-

sion problem. The main difference between classification and regression tasks falls into the type of the

outcome, whether it is discrete or continuous. If the outcome is discrete the problem in hand is a clas-

9

sification problem and we can rename the outcome as class, where these classes are pre-defined. The

mapping function f(x) predicts the class for a given set of features, usually the prediction is given by a

continuous value indicating the probability of a given observation belonging to each class. A predicted

probability can be converted into a discrete class value by selecting a probability threshold value that

decides between class labels probabilities.

MachineLearning

ReinforcementLearning

UnsupervisedLearning

SupervisedLearning

Classification Regression Clustering Density Estimation

Model Based TemporalDifference Policy

Figure 2.1: Different types of machine learning techniques.

If the outcome is continuous the task being handled is of the regression type and normally the

outcome describes a quantity. The mapping function predicts a real-value continuous quantity for a

given observation.

Unsupervised Learning differs from Supervised Learning since the only known data are the ob-

servations, i.e., all data is unlabelled. The goal of Unsupervised Learning is to model the underlying

distribution in the data in order to learn more about it and extract more knowledge. According to [39]

the most commonly used strategies are: Cluster Analysis [40], Data Compression [41] and Anomaly

Detection [42].

Reinforcement Learning concerns in developing intelligent agents capable of learning to achieve

a certain goal, in an uncertain and likely complex world, through a sequence of actions in order to

maximize its reward [43]. The designer of such agent sets the reward policy and settles the trade-off

between exploration and exploitation so the agent takes sub-optimal actions in order to correct the policy

[44].

2.1 Classification Algorithms

This section explains some of the most used Supervised Learning algorithms in healthcare for clas-

sification of dichotomous data. Firstly the probabilistic approaches, both discriminative and generative

models [45], will be described, followed by a similarity-based approach, a margin model.

10

2.1.1 Logistic Regression

LR is a discriminative log-linear probabilistic type of modelling where the outcome is a categorical

variable. It is assumed that the probability of each class can be constructed from a linear combination

of the features describing the observation.

P [Class = 1 | X = x] =1

1 + e−w>φ(x)(2.1.1)

This method is named after its core function, the logistic function, equation 2.1.1, which models

the probability that the binary outcome is a function of a vector of N predictor variables describing

the observation φ(x), where φ(x) = [φ1(x), φ2(x), . . . , φN (x)]> and regression coefficients, sometimes

referred as weights, w given by w = [w1, w2, . . . , wN ]> where each weight indicates how much each

feature contributes to the prediction of each class Class. The weight vector can be directly optimized

using any local search method, such as Coordinate Descent (CD) and Stochastic Gradient Descent

(SGD).

This method loss is represented in equation 2.1.2, which is the negative log-likelihood.

l(x, y, f) = − log (f(y | x)) (2.1.2)

Given equation 2.1.1 one can easily compute the logistic function for the other outcome noting the

following:

P [Class = 0 | X = x] = 1− P [Class = 1 | X = x] =e−w

>φ(x)

1 + e−w>φ(x)=

1

1 + ew>φ(x)(2.1.3)

With the result of equation 2.1.3 and equation 2.1.1 we can rewrite equation 2.1.2 as expressed in

equation 2.1.4.

.

l(x, y, f) = log(e−yiw

>φ(xi) + 1)

(2.1.4)

In a LR model, the inherent inductive bias is related with the assumed form of the dependence of the

output on the features. In particular, LR is a type of log-linear model, in which the (log-) output depends

linearly on the features describing the observation.

For the LR model the decision boundary, assuming a probability threshold of 0.5, is given by the

solution to the equation P [Class = 0 | X = x] = P [Class = 1 | X = x], which yields equation 2.1.5.

w>φ(x) = 0 (2.1.5)

11

If we instead consider another probability threshold, called t for generalization purposes, then the

decision boundary, given by equation 2.1.6, is the result of the equation P [Class = 1 | X = x] = t.

w>φ(x) = − ln

(1− tt

), 0 < t < 1 (2.1.6)

Given a training set D = {(xn, yn), n = 1, 2, ..., N}, where x represents a data point and y the

respective Class, the Empirical Risk which represents a proxy of the true risk for the LR model is given

by equation 2.1.7, corresponding to the negative log-likelihood of the observed data. According to the

Empirical Risk Minimization principle [46], it is reasonable to minimize this quantity since the underlying

distribution µD is unknown. However the Empirical Risk provides no information about how well a given

mapping function generalizes beyond the training data.

RLR =1

N

N∑i=1

log(e−yiw

>φ(xi) + 1)

(2.1.7)

2.1.2 Naive Bayes

The Naive Bayes (NB) classifier is a simple probabilistic model based on applying Bayes theorem

with strong and naive assumptions of conditional independence between each pair of features. Unlike

the discriminative methods these type of models implicitly estimate the underlying distribution µD making

them generative models. Despite the fact that this assumption is usually false, there are some theoretical

reasons for the apparently unreasonable efficacy of NB classifiers [47]. Another problem concerns the

quality of the probability estimates, but, as pointed out in [48], despite the probability over estimation the

decision making only concerns on which probability is bigger in order to classify an observation as being

either Class 0 or 1.

P (y | φ1(x), . . . , φN (x)) =P (y)P (φ1(x), . . . φN (x) | y)

P (φ1(x), . . . , φN (x))(2.1.8a)

P (φi(x) | y, φ1(x), . . . , φi−1(x), φi+1(x), . . . , φN (x)) = P (φi(x) | y) (2.1.8b)w�P (y | φ1(x), . . . , φN (x)) =

P (y)∏Ni=1 P (φi(x) | y)

P (φ1(x), . . . , φN (x))(2.1.8c)

According to Bayes theorem, the relationship between given class independent variable y and fea-

ture vector φ(x) = [φ1(x), φ2(x), . . . , φN (x)]> is given by equation 2.1.8a, where P (y) is the prior,∏N

i=1 P (φi(x) | y) the likelihood and P (φ1(x), . . . , φN (x)) the evidence. As already stated, it is assumed

independence between each feature φi(x), i = {1, . . . , N}. Replacing the conditional probability in the

12

second term of equation 2.1.8a with the result of equation (2.1.8b) the relationship between y and φ(x)

is thus simplified to equation 2.1.8c. Taking into account that the evidence does not depend on the type

of class and that the values of the features are known the evidence is constant. Given this, this term

can be discarded from equation (2.1.8c) resulting in equation 2.1.9a. The decision rule is thus given by

equation 2.1.9b, using the maximum a posteriori estimation, that chooses the hypothesis that is most

probable.

P (y | φ1(x), . . . , φN (x)) ∝ P (y)

N∏i=1

P (φi(x) | y) (2.1.9a)

(Maximum a posteriori estimation)w�y = arg max

y

(P (y)

N∏i=1

P (φi(x) | y)

)(2.1.9b)

A class prior may be calculated by assuming equiprobable classes, or by calculating an estimate for

the class probability from the training set, i.e., P (yi) = Number of samples of class itotal number of samples . To estimate the parameters

for a features’ distribution, it is necessary to assume a distribution for the likelihood of the features,

P (φi(x) | y), or generate non-parametric models for the features from the training set [49]. This is where

the different NB classifiers differ, since they make different assumptions regarding the likelihood.

Multinomial Naive Bayes

Multinomial Naive Bayes (MNB) implements the NB algorithm for multinomially distributed data,

where the distribution is parametrized by vectors θy = (θy1 , . . . , θyn) for each class y, where n is the

number of features and θyi is the likelihood, P (φi(x) | y), of feature i appearing in a sample belonging

to class y.

Commonly, it is used Maximum Likelihood Estimation to estimate the parameters θy given by equa-

tion 2.1.10, where Nyi =∑x∈T φi(x) is the number of times feature i appears in a sample of class y in

the training set, and Ny =

n∑i=1

Nyi is the total count of all features for class y.

θyi,empirical =NyiNy

(2.1.10)

The problem with the Maximum Likelihood Estimation estimate is that it is zero for a term-class com-

bination that did not occur in the training data, i.e., if a certain quantity did not appear during the training

phase, then the Maximum Likelihood Estimation estimates will be zero. This is due to sparseness since

13

the training data is never large enough to represent the frequency of rare events adequately.

One way to solve the sparseness problem is through smoothing techniques, where it is added a

pseudo-count factor α in the empirical posterior probability estimate that controls the smoothing strength

in order to account for features not present in the learning samples and prevents zero probabilities in

further computations. Given this, instead of using equation 2.1.10 to compute the feature likelihood

alternatively equation 2.1.11 will be used.

θyi,α-smoothed =Nyi + α

Ny + αn, α ≥ 0 (2.1.11)

Recalling equation (2.1.9a) the decision rule for the MNB classifier is given by equation 2.1.12.

y = arg maxy

(P (y)

N∏i=1

θyi,α-smoothed

)(2.1.12)

Complement Naive Bayes

Complement Naive Bayes (CNB) is an adaption of the MNB algorithm and is suited to deal with im-

balanced datasets on text classification tasks [50]. Particularly, CNB uses statistics from the complement

of each class to compute the θy vectors.

θyi =α+

∑j:yj 6=y dij

αn+∑j:yj 6=y

∑k dkj

(2.1.13)

The likelihood is thus computed as shown in equation 2.1.13, the summations are over all data points

xi /∈ class y, hence the name of the algorithm, dij is either the count or Term Frequency-Inverse Docu-

ment Frequency value of term i in document j, both explained in sections 3.6.1 and 3.6.2, respectively.

If feature φi(x) does not come from textual data e.g., a physiological variable, dij is the value of feature

φi(x) for data point xj . α is a smoothing hyper-parameter like that described before.

The classification rule is given by 2.1.14, i.e., an observation is assigned to the class with the lower

probability, that is the poorest complement match. The normalization wyi =log θyi∑j | log θyi |

addresses the

tendency for longer documents, when dealing with textual data, to dominate parameter estimates in

MNB.

y = arg miny

(P (y)

∑i

wyi

)(2.1.14)

14

2.1.3 Support Vector Machines

The two previous approaches assume that the training data adheres to a probabilistic structure, i.e.,

the relation between the input and the output can be described using a probabilistic framework. The

goal of such learning algorithms is thus to uncover such structure from the data by building probabilistic

models using the training data and relying on such model to make predictions about unseen data,

assuming that the model was capable to generalize beyond the training data.

Support Vector Machine (SVM) is an elective kind of approach in which data is its own one of a

kind model, this implying that this algorithm’s assumption is that the geometry of the observations is

central in deciding the outcome related with each single observation. The aim of the SVM classifier is to

maximize a geometric margin of hyper-plane, i.e., maximizing the distance between the examples in the

training set and the decision boundary, also known as the margin. In this type of classifiers, the distance

between the decision boundary and the closest observations in both classes is exactly the same.

Since this method maps directly the hidden information to their respective classes without modeling

any likelihood conveyance or structure of the data, just like the LR approach, SVM corresponds to a

discriminative model. The original SVM algorithm, developed by [51], was based on the work of Vladimir

N. Vapnik and Alexey Ya. Chervonenkis on pattern recognition [52]. The main disadvantage of the

original SVM algorithm is that it can only work when data is linearly separable. Another unfavourable

condition is that, even if there exists an hyper-plane capable on linearly separate our data, there may be

errors, e.g., outliers, present in the data. This condition may lead to a small margin and, since in general

the larger the margin the lower the generalization error of the classifier, this is not desirable.

In order to overcome such unfavourable conditions the standard SVM algorithm was modified in

order to accommodate non-linear transformations, this is known as the kernel trick and is achieved by

replacing every dot product by a non-linear kernel function [53]. Another adjustment was to have a less

strict objective function, this was achieved by adding slack variables to the optimization problem that

penalizes observations that fall inside the margin [54], this is known as soft-margin SVM.

To better understand the differences between these algorithms, figure 2.2 serves as a comparison

of the soft and hard margin algorithms given a toy set with two features φ1(x) and φ2(x), and an outlier

in one of the classes. It is illustrated in subfigure (2.2a) the effect of noise on the decision boundary

and the model’s margin when using the original SVM approach. The model has a very small margin

and the decision boundary is not capable of generalization. On the other hand, as showed in subfigure

(2.2b), the relaxation provided by the slack variables resulted in a bigger margin and the algorithm is

more capable to generalize better when compared to the one in subfigure (2.2a).

Like most ML methods, the present algorithm also has a set of hyper-parameters where the main

one is the kernel. It maps the observations into some feature space in order to make them more easily

separable after transformation. There are many types of kernel functions, since they can be defined as

the designer best finds, but the standard ones are the linear, radial basis, polynomial and sigmoid.

15

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0ϕ2(x)

1.0

1.5

2.0

2.5

3.0

3.5

4.0ϕ 1

(x)

Support Vectors

(a) Hard-margin SVM.

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0ϕ2(x)

1.0

1.5

2.0

2.5

3.0

3.5

4.0

ϕ 1(x)

Support Vectors

(b) Soft-margin SVM.

Figure 2.2: Comparison between the hard-margin SVM and the soft-margin SVM for a toy set.

Given a classification problem with a training set that consists of N instances X = {x1, x2, . . . , xN},

a pre-defined set of K features φ = [φ1, φ2, . . . , φK ] and an outcome y = {−1, 1}, the (soft-margin) SVM

classifier is the solution to the constrained optimization problem depicted in equation 2.1.15.

The common hyper-parameter between the kernel functions, C, strikes a tradeoff between the width

of the margin and how much ”violations” to the margin are admissible. The vector w = [w1, w2, . . . , wK ]

is the normal vector to the hyper-plane, b is a constant, Φ is the transformation function, and ξi are the

penalty terms corresponding to the distance from the margin that in the optimization problem serve as

slack variables that allow ”violations” to the constraint yi(w>Φ(xi) + b) ≥ 1− ξi.

minw,b,ξ

{1

2〈w>,w〉+ C

N∑i=1

ξi

}subject to yi(w

>Φ(xi) + b) ≥ 1− ξi, (2.1.15)

ξi ≥ 0, i = 1, ..., N

The formulation described in equation 2.1.15 is known as the primal formulation, applying the La-

grange Multiplier method to this formulation and the Karush–Kuhn–Tucker conditions to the correspond-

ing Lagrangian the optimization problem becomes the one formulated in equation 2.1.16. For a full

mathematical description the reader is suggested to [55]. This formulation is named as the dual prob-

lem where α are the Lagrange multipliers and Q is a N by N positive semidefinite matrix with Qij ≡

yiyjK(xi, xj) and K(xi, xj) = φ(xi)>φ(xj) is the kernel.

minα

{1

2α>Qα− e>α

}subject to y>α = 0 (2.1.16)

0 ≤ αi ≤ C, i = 1, ..., N

16

The SVM classifier can be determined by solving the associated quadratic optimization problem,

either in its primal or dual form (equation 2.1.15 and equation 2.1.16 respectively). Early approaches

relied on specialized methods for quadratic programming but recent methods, however, instead rely

on local search approaches, such as CD and SGD, since the performance of such approaches are

very competitive when compared to other more complex optimization algorithms and are suitable when

dealing with large training sets [56]. To that purpose, and recalling the Empirical Risk Minimization

principle described in [46], we rewrite the problem formulated in equation 2.1.15 as the equivalent one

described in equation 2.1.17.

minw,b

{1

2〈w>,w〉+ C

N∑i=1

max{0, 1− yi(w>Φ(xi) + b)}

}(2.1.17)

The Empirical Risk for the soft-margin SVM classifier is given by equation 2.1.18, associated to the

loss function equation 2.1.19, also known as hinge loss.

RSVM =1

N

N∑i=1

max{

0, 1− yi(w>Φ(xi) + b)}

(2.1.18)

l(x, y, f) = max{

0, 1− yi(w>Φ(xi) + b)}

(2.1.19)

The decision function for this approach is expressed in equation 2.1.20 where w>Φ(xi) + b is the

equation of the separating hyperplane.

f(x) = sgn(w>Φ(xi) + b) (2.1.20)

2.2 Learnability

In subsections 2.1.1 and 2.1.3 the term risk was used without prior definition. Let’s write l(x, y, f) to

represent the loss incurred by a mapping function f given observation x when the desired outcome is y.

The value of l(x, y, f) can be expressed in terms of a simpler function, d, that measures the ”distance”

between two outcomes. In particular, d (y, y) denotes the observation-independent loss incurred by

choosing outcome y when the desired is actually y.

Then, given d, l can be expressed as equation 2.2.1, where y is a random variable denoting the

outcome prescribed by f given observation x. In the case of a dichotomous outcome, d (y, y) = I [y 6= y],

17

yielding equation 2.2.2.

l(x, y, f) = Ef [d (y, y)]def=∑y∈y

(d (y, y) f (y | x)) (2.2.1)

l(x, y, f) = 1− f(y | x) (2.2.2)

The expected loss, already mentioned as risk, of a mapping function f is given by equation 2.2.3,

where µD is the unknown underlying distribution given by equation (2.2.4), with µ0 = P[X = x].

L(f) = EµD[l(x, y, f)]

def=∑obser.

∑y

l(x, y, f)µD(x, y) (2.2.3)

µD(x,y) = P [Y = y | X = x]P [X = x] = f∗(y | x)µ0(x) (2.2.4)

In practice, prior knowledge usually takes the form of assumptions regarding the mapping function,

which introduces some biases. We refer to inductive bias as being the bias resulting from assumptions

that the learner uses to predict outputs given inputs that it has not encountered, and the set of mapping

functions considered by a learning agent as its hypothesis space, H.

The performance of a learning algorithm will depend on the hypothesis space considered and its

relation to the target mapping function f∗. For any learning algorithm A the following applies:

ED∼µD,f∼A(D)

[L(f)

]= L∗ +

(minf∈H

L(f)− L∗)

+

(ED∼µD,f∼A(D)

[L(f)

]−minf∈H

L(f)

)(2.2.5)

The first term of equation 2.2.5 corresponds to the unavoidable risk induced by the randomness of

the target mapping function. Such term is usually referred as the Bayes error rate and it does not depend

on any particular design choice. The second term, compares the risks of the best hypothesis in H and

the Bayes optimal policy. In a sense, this term accounts for the impact of the inductive bias in terms

of performance where richer hypothesis spaces tend to imply smaller inductive bias. It is, however,

independent of the learning algorithm and depends only on the choice of H. The third term measures

how accurately the learning algorithm A is able to identify the best hypothesis in H. It depends both on

A and the choice of H.

18

2.3 Overfitting and Regularization

Given a training set D and an arbitrary mapping function f , the risk, L(f), is expressed in equation

2.3.1. The decomposition comprises two components contributing to the risk incurred by f . The first

term corresponds to the Empirical Risk associated with f , measuring how ”well fit” f is to the training

data in D. The second term measures how well the empirical risk estimates the true risk of f .

L(f) = LN (f) + (L(f)− LN (f)) (2.3.1)

More generally, given a learning algorithm A the empirical risk is written as in equation 2.3.2.

ED∼µD,f∼A(D)

[L(f)

]= ED∼µD,f∼A(D)

[LN (f)

]+ ED∼µD,f∼A(D)

[L(f)− LN (f))

](2.3.2)

A dummy example that reports the resulting average loss incurred as a parameter, named C, varies

from 0 to 7, is presented in figure 2.3. The dashed line corresponds to the empirical risk, obtained in

the training set, while the solid line corresponds to the true risk, estimated from the data never seen by

the learning algorithm, usually referred as the test set. The plot showcases the aforementioned tradeoff

between fitting and stability. For small values of C the system incurs a big loss both in the training and

in the test sets. Such situation occurs when the loss incurred by the algorithm is dominated by the term

ED∼µD,f∼A(D)

[LN (f)

], a situation known as underfitting.

Conversely, for large values of C, the term ED∼µD,f∼A(D)

[L(f)− LN (f))

]dominates the loss in-

curred by the algorithm. In fact, as C increases, the empirical risk goes to zero, but the difference

between the empirical and true risks keeps increasing, indicating that the former becomes a bad esti-

mate for the latter. As a consequence, the learning algorithm becomes too specialized in the training

data and its ability to generalize increasingly deteriorates, as illustrated in figure 2.3. Such phenomenon

is known as overfitting.

Overfitting also implies that small changes to the training set may translate into significant changes

to the mapping function resulting from the learning algorithm. Therefore, in selecting the value for C,

there must exist a balance between the ability of the learning algorithm to fit the training data and its

stability to changes in such data.

The overfitting phenomenon results from the fact that µD is unknown and, therefore, one must rely

on the empirical risk as an estimate of L(f). Unfortunately, the empirical risk LN (f) underestimates the

true risk L(f), and the difference between the two increases as the complexity of the hypothesis space

increases. Hence, the selection of the hypothesis space involves a tradeoff between its ability to fit the

data and the stability of the learning algorithm. To some extent, the fitting-stability tradeoff in equation

2.3.2 is similar to the bias-complexity tradeoff in equation 2.2.5, and can be addressed by controlling the

19

0 1 2 3 4 5 6 7

0.9

0.0

0.15

0.3

0.45

0.60

0.75

Test Error

Training Error

Loss

C

L( ) − ( )f L N f

( )L N f

Figure 2.3: Empirical illustration of the fitting-stability tradeoff. The solid line corresponds to theestimated true risk, while the dashed line corresponds to the empirical risk. The plot highlights the two

terms in equation 2.3.1

hypothesis space.

An alternative approach to controlling the complexity of the hypothesis space is to use regularization.

As already stated in 2.1.1, the Empirical Risk Minimization principle sugests that the Empirical Risk can

be used to surrogate the true risk associated with a mapping function. Therefore, in terms of Empirical

Risk Minimization, learning consists in solving the optimization problem expressed in equation 2.3.3.

minf∈H

LN (f) (2.3.3)

The use of regularization offers an alternative to the Empirical Risk Minimization principle, known as

the Regularized Loss Minimization (RLM) principle [57]. In terms of RLM, learning consists in solving

the optimization problem in equation 2.3.4, where R(f) is a regularization term associated with the

mapping function f . Typically, R(f) should be higher for more complex hypothesis and lower for simpler

hypothesis generating a more parsimonious model. The regularization term thus acts as a “stabilizer”,

indicating it is admissible to perform a little worse in the training set if that means using a simpler (and

potentially more stable) hypothesis.

minf∈H

(LN (f) +R(f)

)(2.3.4)

The L1−regularization method, also called Least Absolute Shrinkage and Selection Operator (LASSO)

regression adds absolute value of magnitude of coefficient, as represent by ‖W‖1 in equation 2.3.5, as

20

penalty term to the loss function. LASSO shrinks the less important features coefficient to zero, i.e.,

some of the features are completely neglected for the evaluation of the outcome. So, LASSO helps to

reduce the model complexity.

Equation 2.3.5 expresses the LASSO regression regularization function, where λ is the tuning pa-

rameter that decides how much penalty incurs into the flexibility of the model. The increase in flexibility

of a model is represented by increase in its coefficients, and if it is desired to minimize equation 2.3.4,

then these coefficients need to be small.

RLASSO(f) = λ‖W‖1 (2.3.5)

In L2− regularization method, also called as Ridge regression, the cost function is altered by adding

a penalty equivalent to square of the magnitude of the coefficients, as represent by ‖W‖2 in equation

2.3.6. The penalty term λ regularizes the coefficients such that if the coefficients take large values the

optimization function is penalized. So, ridge regression shrinks the coefficients.

RRidge(f) = λ‖W‖2 (2.3.6)

2.4 Model Assessment

In order to measure the risk of a model according to an hypothesis and the ability of the given model

to generalize beyond the training set it is necessary to choose a metric. According to [58, 59], the most

common metric used in healthcare is the ROC-AUC. It should be taken into account that when dealing

with statistical models there is always some variability and uncertainty associated to them since it de-

pends on the population in study. In order to quantify this uncertainty one resorts to confidence intervals,

applied to the performance metrics, and the McNemar’s hypothesis test to compare ML algorithms.

2.4.1 Performance Metrics

The ROC-AUC, sometimes referred as c-statistic, is a performance measurement for classification

problems at various thresholds settings. It tells how much the model is capable of distinguishing between

classes. The higher the ROC-AUC score, the better is the model at predicting the outcome.

An excellent model has a ROC-AUC score near to the 1 which means it has good measure of

separability. And when the ROC-AUC score is 0.5, it means that the model has no class separation

capacity whatsoever, i.e., it is not better than a random guess.

ROC plots the False Positive Rate (FPR), in the x-axis, versus the True Positive Rate (TPR), y-axis,

for a number of different candidate probability threshold values between 0.0 and 1.0. Put another way, it

21

0.0 0.2 0.4 0.6 0.8 1.0FPR

0.0

0.2

0.4

0.6

0.8

1.0

TPR

Model with no skillROC Curve

Figure 2.4: AUC-ROC Curve for a dummy example using a LR model.

plots the false alarm rate versus the hit rate.

Figure 2.4 illustrates a ROC for a dummy set using a LR model. As one can note, each probability

threshold value corresponds to a (FPR,TPR) pair of values. Not only it is possible to use the ROC curve

as a performance measure as well as a tool to choose the threshold that best discriminates between

classes.

The TPR, also called sensitivity, describes how good the model is at predicting the positive class

when the actual outcome is positive. This quantity is calculated as expressed in 2.4.1, where True

Positive (TP) is an outcome where the model correctly predicts the positive class and False Negative

(FN) is an outcome in which the model predicts the negative class, when in reality it is positive.

Sensitivity =TP

TP + FN(2.4.1)

The FPR is the proportion of all negatives that still yields positive test outcomes, i.e., the conditional

probability of a positive test result given an event that was not present. equation 2.4.2 shows how to

compute the FPR, where FP is an outcome in which the model predicts the positive class, when in reality

it is negative.

FPR =FP

FP + TN(2.4.2)

The FPR is also referred to as the inverted specificity where specificity measures the proportion

of actual negatives that are correctly identified as such. The relation between specificity and FPR is

given by equation 2.4.3, where True Negative (TN) is an outcome where the model correctly predicts the

22

negative class.

Specificity = 1− FPR =TN

TN + FP(2.4.3)

Departing from the ROC curve and the performance measures associated with it. Precision is an-

other way to assess a model performance. Precision, presented in equation 2.4.4, measures the pro-

portion of real positives from all of the instances that were identified as being positive.

Precision =TP

TP + FP(2.4.4)

Another way to present the values of precision and sensitivity is by using the F1-Score. The F1-Score

is the harmonic mean of precision and sensitivity which is given by equation 2.4.5. So if one number is

really small between precision and sensitivity, the F1-Score is more closer to the smaller number than

the bigger one, giving the model an appropriate score.

F1-Score = 2× Precision× SensitivityPrecision+ Sensitivity

(2.4.5)

In order to measure the interrater reliability one resorts to the Cohen’s Kappa Statistic, κ. This

coefficient is statistically robust when dealing with unbalanced classes given that this statistic favors

correct classifications of the minority class over the majority one. This statistic is defined as expressed

in equation 2.4.6, where po is the empirical probability of agreement on the label assigned to any sample,

and pe is the expected agreement when both annotators assign labels randomly.

κ = (po − pe)/(1− pe) (2.4.6)

2.4.2 Bootstrap Confidence Interval

A confidence interval is a bounds on the estimate of a population variable, e.g. performance metrics.

It is an interval statistic used to quantify the uncertainty on an estimate [60]. It is common practice to

assume, in binary classification problems, that the predicted variable can be described as a succes-

sion of independent events that either succeed or fail, i.e. a Bernoulli trial, and when the sample is big

enough one can assume that the predicted variable is normally distributed and one can then compute

a parametric confidence interval. Despite the common use of parametric confidence intervals the as-

sumptions that underlie parametric confidence intervals are oftenly violated since the predicted variable

is not always normally distributed, and even if it is, either the variance of the normal distribution is not

equal at all levels of the predictor variable or the assumption that each sample is independent does not

stand [61]. Since the underlying distribution is unknown one resorts to non-parametric estimation of

23

confidence intervals. In these cases, the bootstrap resampling method can be used as a non-parametric

method for estimating confidence intervals of a parameter for a given population. The bootstrap is a sim-

ulated Monte Carlo method where samples are drawn from a fixed and finite dataset with replacement

and a parameter, e.g. performance metrics, is estimated on each sample. This procedure leads to a

robust estimate of the true population parameter via sampling.

When doing bootstrapping in the field of ML it is common to assume that the resampled population

is of the same size of the original one but other proportions might be considered , e.g. 80% depending

on time complexity of the models, computational power and available time. This resampled population

is used for training a ML model and when resampling is finished there will be some samples that were

not used, this ones are called out of bag samples and are used to test the model performance. In

order to get reliable estimates it is recommended to draw 50 to 200 bootstrap samples [62]. For each

bootstrap round the population variable is computed and stored until the number of pre-defined bootstrap

rounds is achieved. Given the stored variables one can compute confidence intervals given a degree

of confidence α. Commonly confidence intervals are computed assuming a 95% confidence in order to

reduce uncertainty [63]. A 95% confidence interval means that when repeating the sampling one should

expect that one time out of twenty intervals will not include the true population variable value.

One way to compute the bounds of the confidence interval is by using the percentile method [64],

where the lower bound and the upper bound are computed using the the α1 and α2 percentiles of the

bootstrapped distribution, respectively. α1 = α is the α1 percentile of the bootstrapped distribution and

α2 = 1− α is the α2 percentile of the bootstrapped distribution used for computing the 100× (1− 2× α)

confidence interval given a degree of confidence α.

2.4.3 McNemar’s Hypothesis test

Hypothesis testing is a statistical method that is used in making statistical decisions using experi-

mental data. Hypothesis testing is an essential procedure in statistics since it evaluates two mutually

exclusive statements about a population to determine which statement is best supported by the sample

data. The parameters associated with hypothesis testing are:

• Null Hypothesis (H0) - General statement that there is no relationship between two measured

phenomena, or no association among groups;

• Alternative Hypothesis (H1) - Hypothesis used in hypothesis testing that is contrary to the null

hypothesis;

• Level of significance (α) - Refers to the degree of significance in which one accept or reject the

null hypothesis;

• P-value (p) : The P-value is the probability of finding the observed results when the null hypothesis

of a study question is true.

24

Like confidence intervals, there are some complications when choosing the right approach to do

hypothesis tests. There are also parametric and non-parametric strategies, where the parametric one

assume normally distributed data. Since this assumption is commonly violated one can resort to non-

parametric, i.e., distribution-free, hypothesis tests. Another advantage of this approach is its compu-

tational efficiency and it is suggested when it is expensive to train multiple copies of classifier models

on big datasets [65]. The McNemar’s hypothesis test assesses if a statistically significant change in

proportions have occurred on a dichotomous trait. This test operates upon a 2 × 2 contingency table,

which is a count of two binary variables for a testing and training set as illustrated in tables 2.1 and 2.2.

Table 2.1: Dummy example for the development of a Contingency Table.

Instance Classifier1 Correct Classifier2 Correct1 Yes No2 No No3 No Yes4 No No5 Yes Yes6 Yes Yes7 Yes Yes8 No No9 Yes No10 Yes Yes

Table 2.2: Contingency table derived from the dummy example.

Classifier1Correct Classifier1IncorrectClassifier2Correct 4 (Y es/Y es) 1 (No/Y es)Classifier2Incorrect 2 (Y es/No) 3 (No/No)

The McNemar’s test statistic is calculated as illustrated in equation 2.4.7, where Y es/No is the count

of test instances that Classifier1 correctly classified and Classifier2 incorrectly classified, and No/Y es

is the count of test instances that Classifier1 incorrectly classified and Classifier2 correctly classified.

As it is noticeable, the statistic is reporting on the different correct or incorrect predictions between the

two models, not the accuracy or error rates. In terms of comparing two binary classification algorithms,

the test is commenting on whether the two models disagree, or not, in the same way and not commenting

if one model is more or less accurate than other.

The two terms used in the calculation of the McNemar’s Test capture the errors made by both mod-

els. The test checks if there is a significant difference between the counts in these two cells. If these

cells have counts that are similar, it shows us that both models make errors in approximately the same

proportion, only on different instances of the test set. In this case, the result of the test would not be

significant and the null hypothesis would not be rejected, given a significance level as illustrated in equa-

tion 2.4.8a where p is the p-value. If these cells have Y es/No and No/Y es counts that are not similar,

it shows that both models not only make different errors, but in fact have a different relative proportion

of errors on the test set. In this case, the result of the test would be significant and one would reject the

25

null hypothesis, given a significance level as summarized in equation 2.4.8b.

statistic =(Y es/No−No/Y es)2

(Y es/No+No/Y es)(2.4.7)

p > α =⇒ fail to reject H0 (2.4.8a)

p <= α =⇒ reject H0 (2.4.8b)

26

Chapter 3

Natural Language Processing

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that deals with the in-

teraction between computers and humans using the natural language. The objective of NLP is focused

on enabling computers to read, decipher, understand, and process human languages, to get computers

closer to a human-level understanding of language. Computers do not have the same intuitive under-

standing of natural language that humans do. They cannot really understand what the language is really

trying to say.

Language is a type of correspondence between humans. One human discharges a message spoken

by a particular combination of acoustic or graphic signs to someone else who imparts some common

sense knowledge to the sender which should empower the receiver to comprehend the message.

It was the philosopher Charles Morris who presented the triplet ”syntax-semantics-pragmatics” [66]

and expressed that the investigation of pragmatics incorporates the total condition of an individual who

speaks or hears. This incorporates the investigation of semantics regarding the relationship of expres-

sions to their meaning. The study of syntax looks at the properties and structure of a language. The

lexicon and morphology are sublevels of the syntactic dimension and concern the analysis of the words

and word development (inflection, derivation and compounding).

Each dimension can create ambiguities, some of them must be settled by the learning of the genuine

situation that is portrayed by the sentence. A complete NLP framework ought to have the capacity to

deal with all the referenced dimensions (to a limited degree) [67]. Some propose a fell engineering

where the ambiguities of the lower levels are settled successively by the resulting levels [68].

As of now, in the field of linguistics, it is considered as valuable to limit research to well-characterized

sublanguages [69]. A sublanguage is a specialized language that is utilized by the different on-screen

characters in the specialized field to pass explicit messages. A specialized language exhibits a few

attributes that separate it from the general language [70]. An obvious purpose of contrast is the vocabu-

lary. In each sublanguage, typical words can be found in their standard meaning. Be that as it may, a lot

of general language words can take an increasingly confined and explicit significance with regards to a

27

General Language

Engineering Sublanguage

Healthcare SublanguagePhilosophy Sublanguage

Thermodynamics

Mechanics

Control

Emergency Medicine

Intense Care

Triage

Ethics

Metaphysics

Existentialism

Social media Sublanguage

Slang

Abbreviation

Figure 3.1: Partition of a general language into specialized languages.

sublanguage [69]. Different words have a general importance alongside a progressively specialized one,

of which just the last is utilized with regards to the sublanguage. At last, for every specialized space,

there exists a lot of quite certain vocabulary that is for the most part solely utilized in that specific spe-

cialized area [71]. An example of a partitioned general language into sublanguages with their respective

vocabulary can be depicted in figure 3.1

Computers are great at working with standardized and structured data since they are able to process

that data much faster than humans can. But humans don’t communicate in structured data, as already

stated, humans resort to language, a form of unstructured data. When programming computers, the

developer is essentially giving the computer a set of rules that it should operate by. With unstructured

data, these rules are quite abstract and challenging to define concretely due to ambiguities arising from

the ”syntax-semantics-pragmatics” triplet.

Humans have been recording text data over the years. Over that time, our mind has picked up

a gigantic measure of involvement in understanding natural language. When we read something, we

comprehend what it truly implies in reality since we know its context.

That being stated, late advances in Machine Learning have empowered computers to do a con-

siderable amount of valuable tasks with natural language [72] such as machine translation, speech

recognition, speech synthesis, language translation, semantic understanding, and text summarization

[73,74].

Given the overview on NLP let’s make the following question, what makes NLP difficult? The following

list summarizes some answers to such question:

• Number of natural languages - each one has distinctive linguistic rules;

• Ambiguity - sentence/words meaning is dependent on its context;

• Idiomatic expressions - for each natural language there are expressions that gain new connota-

tive meanings and go beyond their literal meanings;

28

• Grammar - Homonyms, homophones, homographs, parons and synonyms words;

• Negative structure - A sentence can be in the negative form.

The following subchapters provide a theoretical explanation of the techniques used to process the

text in order to deal with such complex data.

3.1 Regular Expressions

Formally, a Regular Expression (RE) is an algebraic notation for characterizing a set of strings, i.e.,

they can be seen as an arrangement of one or more character literals, operators, or constructs that

define a search pattern. This strategy is typically used by string search algorithms in tasks like, for

example, ’find’ or ’find and replace’ a string. Given a search pattern, the RE will scan through the corpus

of texts, returning all textual instances that match the pattern.Taking this into account, one can say that

this technique is very useful in extracting information from any text.

Like any language, RE has its own vocabulary and syntax that one must know. One of the operations

within the domain of RE is concatenation. Given two RE A and B, they can be concatenated to form

new RE AB. In general, if a string s1 matches A and another string s2 matches B, the string s1s2 will

match AB. Suppose that the document being analyzed is given by ’The production management exam

was very difficult!’, the RE given by A = exam will match the string s1 = ’exam’ and the RE given by B =

difficult will match the string s2 = ’difficult’. IfA andB are concatenated intoAB, the output will be s1s2 =

’exam difficult’. The previous example shows one of the most common uses of RE, matching characters,

where it was showed that most letters and characters will simply match themselves. There are some

exceptions to this saying because some characters are denominated as metacharacters which do not

match themselves. Rather, they flag that some singular string should be matched, or they influence

different parts of the RE by rehashing them or changing their significance. table 3.1 summarizes these

characters and their respective task. In order to best understand how these metacharacters work a set

of examples is presented inside the healthcare scope, presented in table 3.2. Suppose that managers of

the healthcare facility want to obtain the information of patients with less than 10 years old. One simple

way to provide a solution to this problem can be by means of the . character by applying the RE with

pattern search equal to . yielding the last row of table 3.2. Another solution would be to generate a set

of characters, by means of the [ ] character, as [0 − 9]. This search pattern returns a match for any

age between 0 and 9. If the objective now is to extract all rows where the patient temperature is bigger

than 40 oC one can resort to metacharacter ˆ, where the RE search pattern is given by 4. This returns

a match for every temperature whose value starts with a 4. If the healthcare managers want know how

the number of visits has been evolving over the years it is necessary to extract the year of the Date field.

One way to solve such task is by using a concatenation of special sequences, since the metacharacter

that is being used is \, like for example \d\d\d\d, this will output all digits, ranging from 0000 to 9999.

One can rewrite the last RE, using the { } character, as \d{4}, this will search for a digit, in the range

29

0 − 9, four times. Suppose that now the goal is to create a new table with all patients that have Non-

urgent and standard priorities, coded as 5 and 4 respectively. A RE that can extract the information in

order to solve this problem is the one given by 4|5, where a match exists if the digits 4 or 5 appear. If the

healthcare managers need to extract the patients whose discriminators are either Head Injury or Chest

Pain one can use the ∗ character and generate the RE expressed as he*, this will match any string that

has an h even if it is not followed by e or followed several times. Another solutions consist on using + or

? instead of ∗ since we are imposing that character e needs to appear at least once or that it can appear

at most once.

In the previous set of examples the metacharacter \ was used in order to search for the year of a

date. This was achieved using the \d special sequence. table 3.3 summarizes the special sequences in

the RE language with a short description. Despite the simplicity of the dummy example, it is possible to

see the versatility and potentialities of RE.

Table 3.1: Regular Expression (RE) metacharacters description.

MetacharactersCharacter Description

. Matches any character except a newline

ˆ Matches the start of the string for a given RE

$ Matches the end of the string for a given RE

∗ Makes the subsequent RE match 0 or more occurrences of the previous RE

+ Makes the subsequent RE match at least 1 repetition of the previous RE

? Causes the resulting RE to match 0 or 1 repetitions of the preceding RE

{ } The resulting RE matches exactly the specified number of repetitions of the previous RE

[ ] Used to indicate a set of characters

\ Either escapes special characters, or signals a special sequence

| Either or operator

( ) Capture and group whatever RE is inside the parentheses

Table 3.2: Dummy Example for Regular Expressions.

ID Temperature [oC] Gender Age Date Priority Discriminator1 37.5 M 37 21-03-2017 09:42 5 Dyspnea2 42 F 28 01-11-2018 15:22 3 Fever3 37 F 42 09-01-2019 08:09 5 Head Injury4 38 M 58 14-10-2018 04:17 4 Chest Pain5 36.8 F 6 15-04-2019 04:17 5 Head Injury

30

Table 3.3: Special Sequences description.

Backslash MetacharacterSequence Description\A Matches only at the start of the string

\B Matches the empty string, but only when it is not at the beginning or end of a word

\b Matches the empty string, but only at the beginning or end of a word

\D Matches any non-digit character

\d Matches any decimal digit

\S Matches any non-whitespace character

\s Matches any whitespace character

\W Matches any non-alphanumeric character

\w Matches any alphanumeric character and the underscore

\Z Matches only at the end of the string

3.2 Tokenization

Tokenization can be described as the task of segmenting texts into tokens, the smallest unit of infor-

mation, for further processing. This can be seen as one of the first steps when pre-processing textual

data. Tokenization strategies are not so trivial to implement since it depends on the language (or sub-

language), goal and on the quality of the data [75,76].

First it is necessary to specify what is the smallest unit of information. Depending on the goal, this

can be anything from a character to compounds. In this work the smallest unit of information, tokens

from now on, is the word. Secondly, the way that the tokenizer is built is subject to the domain where it

is going to be applied.

In the portuguese language one simple strategy to build a tokenizer is to consider all spaces as de-

limiters between tokens. Applying this tokenizer to the sentence ’Hoje o dia esta soalheiro’ (’Today it is

sunny’) the tokenizer will output [’Hoje’,’o’,’dia’,’esta’,’soalheiro’]. Adding a dot to the same example yelds

the sentence ’Hoje o dia esta soalheiro.’, which for us humans is the same. Applying the same tokenizer

will result in [’Hoje’,’o’,’dia’,’esta’,’soalheiro.’]. Now here is a problem, the dot is attatched to the word ’soal-

heiro’ (’sunny’). Humans can interpret ’soalheiro.’ as being ’soalheiro’ but for a computer these are two

different tokens, resulting in a problem, very common in NLP [77], known as data sparseness. Another

approach may be assuming that spaces and punctuation are the delimiters between tokens, the result-

ing tokenizer will output [’Hoje’,’o’,’dia’,’esta’,’soalheiro’,’.’] which is the desired one. The consequence of

this tokenizer is when dealing with the sentence ’Esqueci-me do guarda-chuva’ (’I forgot the umbrella’)

since it will output [’Esqueci’,’-’,’me’,’do’,’guarda’,’-’,’chuva’] instead of [’Esqueci-me’,’do’,’guarda-chuva’]

and thus the original semantic meaning is lost.

There are other type of methodologies when dealing with tokenization, that deal with the complica-

tions described above, resorting to approaches such as the Viterbi Algorithm, Hidden Markov Models,

Part-of-Speech tagging [78–81]. Another sort of difficulties during the tokenization process, when us-

31

ing these type of statistical methods, is dealing with misspellings. When doing part-of-speech tagging,

process of categorizing words based on its definition and relationship between adjacent words, it is

necessary to have words spelled correctly in order to obtain a words part-of-speech [82]. One way to

implement the Viterbi algorithm, a dynamic programming algorithm whose goal is to find the most likely

sequence of hidden states that explains a sequence of observations for a given stochastic model, when

used for tokenization, is to compute the probabilities of the words in a sentence, based on their char-

acters, and choose the most likely sequence of characters [83]. In order to compute each probability it

is required a word frequency list that best describe the (sub)language. One can easily state that this is

not so trivial since it is extremely complicated, not to say impossible, to obtain a corpus that contains

all situations the (sub)language. Assuming that there exists such fully developed corpus, the problem

associated with misspellings persist since it will break the most likely sequence of characters, e.g. ’am-

norreia’ (the word ’amenorreia’ (’amenorrhea’) with a misspelling) may yield [’amnor’,’reia’] depending

on the used corpus. One way to counter this is by smoothing strategies, already described in subsection

(2.1.2).

In order to achieve good results with these techniques it is necessary to have a rich annotated corpus

that can best describe the (sub)language, however to the best of the author’s knowledge, there are no

free and open-source resources for the portuguese language focused on the healthcare domain.

3.3 Word Similarity

One of the problems that concerns a NLP system, as extensively stated before, is the detection of

misspellings. Most typographical errors, typos from now on, can be described by delete, transpose,

replace, or insert operations of a small number of characters. A special form of these kind of typos

are coined as atomic typos. An atomic typo is a misspelling where the misspelled word is an existing

word, e.g., ”I did not understand the massage you sent me” where ”massage” is a typo of ”message”.

In order to deal with such typos it is necessary to resort to advanced techniques that take into account

the context of the word. Another kind of typos reside on phonetically similar words, e.g. ”If the resultant

force on an object is zero, a stationery object stays stationery” where stationery is a typo for stationary.

A word can be compared to another by means of its orthographic form or its phonetic value, in

this work only the orthographic form was considered. When comparing two strings one can partition

the string similarity functions into two groups. A group devoted to compare strings by measuring their

distance, where a smaller distance indicates similar strings and another group that calculates a ratio,

usually ranging from 0 to 1, where 0 indicates no similarity between the strings. This last group of

similarity functions is known as approximate string matching. Immense work is devoted to spellchecking

and a panoply of methods exist, such as dictionary-based spelling correction, N-gram methods, word

frequencies, edit distance, phonetic algorithms, among others [84–86].

In the present work approximate string matching was used. An efficient and conceptually simple

32

method for measuring string similarity is the Jaro–Winkler distance, given by equation 3.3.1c. The

Jaro-Winkler distance metric is an extension of the Jaro distance, given by equation 3.3.1a where m

is the number of matching characters between strings s1 and s2, t is half the number of transposition

operations, and | s1 |, | s2 | are the length of strings s1 and s2, respectively. It scores the number of

common character in the correct order and assumes that differences near the start of the string are more

significant than differences near the end of the string [87,88], as illustrated in equation 3.3.1b since l is

the length of the common prefix, up to a maximum of four characters, and p is a constant scaling factor

for how much the score is adjusted upwards for having common prefixes, empirical works recommend a

value of p = 0.1 [89]. This metric ranges from 0 to 1, when the value of the Jaro–Winkler distance is 0, it

means that the strings are identical. This string similarity function was chosen due to it’s computational

speed, scalability and because usually this approach performs well in short string comparisons, and is

extensively use for duplicate detection in the linkage area [90,91]. As an example, let’s assume s1 = jon

and s2 = john. The matching characters between these strings is ’jon’ yielding m = 3, the number

of transpositions is 0 giving t = 0 since the matching characters are already in the same order, the

length of the strings are | s1 |= 3 and | s2 |= 4. Replacing these values on equation 3.3.1a yields

simj (s1, s2) = 0.917. The longest common prefix is given by jo, resulting in l = 2, replacing the value

obtained for simj and l, given p = 0.1, in equation 3.3.1b, the similarity between these two strings is

simjw (s1, s2) = 0.934, yielding a Jaro-Winkler distance of djw (s1, s2) = 0.076 meaning that the strings

are very similar.

simj (s1, s2) =

0, if m = 0

13

(m|s1| + m

|s2| + m−tm

), otherwise

(3.3.1a)

simjw (s1, s2) = simj (s1, s2) + lp (1− simj(s1, s2)) (3.3.1b)

djw (s1, s2) = 1− simjw (s1, s2) (3.3.1c)

3.4 Text Normalization

Imagine that you are presented with these strings: ’12-2-2019’, ’12/2/2019’, ’12 February 2019’,

’12fev19’. Easily the reader detects that these represent a date and it is always the same. But how does

a computer interpret these strings? For a computer each string is different from each other, despite the

meaning being the same. Text normalization can be seen as the task of reducing all textual data into a

canonical form in order to reduce data sparsity.

Usually, one of the first steps in text normalization is to lowercase, or uppercase, the textual data.

Depending on the application, this might not be a good idea since ’US’ (United States) becomes ’us’

after lowercase and some form of disambiguation needs to exist in the NLP system in order to know if

’us’ represents the noun or the pronoun. Another aspect to take into consideration are abbreviations

33

since many abbreviations may represent the same thing, e.g., the abbreviations ’msd ’, ’msupd ’, ’mem-

bsupdir ’, ’msdireito’ have the same meaning and stand for ’membro superior direito’ (’right upper limb’).

Other textual characteristics that need to be handled are the ones related to temporal representation.

Extracting temporal information is a challenging task and immense work has been dedicated to this task

[92]. Suppose that the sentence ’I had paracetamol one hour ago’ is given as chief main complaint to the

healthcare provider during triage. Resorting only to this information, the healthcare provider can infer

that the patient is complaining about a fever or some sort of pain and knows that the patient can’t take

more substances which contain paracetamol. Temporal expressions are important since they enrich

data with more information regarding events and the time or day they happened, with a certain degree

of uncertainty.

Suppose you are provided with a text document that contains the following sentence ’Ana studies a

lot. She has been studying for seven days.’. The reader directly understands that Ana dedicates her time

to study, given the tokens studies and studying whose context say so. It is normal to encounter different

forms of a word in a document since there exists grammatical rules and some words are derivationally

related to other words with similar meanings, e.g., democracy, democratic, and democratization. One

way to reduce the level of sparsity can be achieved by decreasing inflectional structures and at times

derivationally related types of a word to a typical base structure, which can be accomplished by using a

stemmer. Stemming techniques work by removing the end or the start of the word, by means of crude

heuristic process, considering a rundown of basic prefixes and suffixes that can be found in an inflected

word. This aimless cutting can be successful in certain occasions, however not generally. As the author

knows best, there exists only a few stemmers for the portuguese language, the Porter and the RLSP

Stemmer. In this work the RLSP stemmer was used since it makes less understemming errors and less

overstemming errors when compared to the Portuguese version of Porter’s Algorithm [93].

The RLSP algorithm, illustrated in figure 3.2, is composed of eight steps where each one has a set of

rules. Summarizing each step, the plural reduction has eleven stemming rules since not all plural forms

in portuguese end in ’-s’ and not all words ending in ’-s’ correspond to a plural, e.g. ’oculos’ (’glasses’).

The Feminine Reduction step is composed by fifteen stemming rules and consists in transforming words

in the feminine form to their corresponding masculine form. This step only considers words finishing

in ’-a’. The third step, Adverb Reduction, is the simplest step since it only deals with adverbs that

have as suffix ’-mente’, although not every word with this suffix corresponds to an adverb and when

one uses two or more adverbs with this suffix in a sentence only the last one has this suffix. E.g. ’O

Joao apresentou-se pobre, triste e humildemente, por causa do que tinha acontecido.’ (’John showed

himself poor, sad and humble because of what had happened.’), where ’pobre’ and ’triste’ correspond

to adverbs. The augmentative reduction step comprises 23 rules and it is responsible for removing the

suffixes of nouns and adjectives in the augmentative, diminutive, or superlative form. The fifth step

tests words against 84 noun endings, if a suffix is removed then stemming process is finished and

the remaining steps are not executed. The verb reduction step consists on analyzing the form of the

verb, for a regular verb there are over fifty forms and each one of them has its own suffix. Due to the

34

possibilities the verbal forms are reduced to their root form using 101 rules. The vowel removal step

consists in removing the word last vowel. Lastly, the accents removal step, as the name states, consists

on removing accents and there exists eleven stemming rules. Serves as an example the following

words: ’medicado’, ’medicada’ (both stand for ’medicated ’), ’medicamento’ (’medicine’), ’medicamentos’

(’medicines’), ’medicacao’ (’medication’), and ’medicacoes’ (’medications’). Here we have six different

words but when applying the RLSP Stemmer they all become ’medic’ showing the importance of this

task and its hability to reduce sparsity.

3.5 N-grams

N-grams can be shortly defined as a contiguous sequence of anything from textual data. When

one says anything it is because we can have n-grams with respect to characters, words, phonemes,

syllables, etc.. Since the smallest unit of information in this work is the word then when referring to

N-grams the reader knows that the sequence is given by contiguous N words. Therefore, a unigram

is a sequence of one word, a bigram is a sequence of two words, trigrams sequences of three words,

and so on. N-grams are widely applied in language identification, spelling error detection and correction,

query expansion, information retrieval with serial, inverted and signature files, dictionary look-up, missing

phoneme guessing, machine translation, spam filter, topic spotting, and text compression [94–96].

As an example, consider the sentence ’Once upon a time’. One can consider an unigram model,

then the result will be ’[’Once’, ’upon’, ’a’, ’time’ ]’. Supposing now that the model is given by bigrams the

result is ’[’Once upon’, ’upon a’, ’a time’ ]’. If now we consider trigrams, ’[’Once upon a’, ’upon a time’ ]’.

One can also combine different N-gram models, like for example considering all N-grams, with N ranging

from 1 to 3, outputing ’[’Once’, ’upon’, ’a’, ’time’,’Once upon’, ’upon a’, ’a time’, ’Once upon a’, ’upon a

time’ ]’.

3.6 Feature Extraction

In order to use textual data in a ML algorithm, it is necessary to have some sort of numerical repre-

sentation of it. The objective of this section is to provide the reader with a theoretical overview of two

techniques used to extract knowledge from textual data.

3.6.1 Bag of Words

Bag of Words (BoW) is an approach in NLP to represent a document as the multi-set of N-grams

that appear in it. This creates a simplified vector representation of the text, where the (frequency of)

occurrence of each N-gram is later used as features for a ML model. In this simple model, the syntax

35

Begin

Word ends with an 's'?

Plural Reduction

Yes

Word ends with an 'a'?

No

Feminine Reduction

Yes

Augmentative Reduction

No

Adverb Reduction

Noun Reduction

Suffix Removed

Verb Reduction

No

Remove Accents

End

Yes Suffix

Removed

Yes

Vowel Removal

No

Figure 3.2: Flowchart for the RLSP Stemmer. Adapted from [93].

36

and even the order of words is discarded, only telling weather a word is present, or not, in a document.

Let’s assume that the set of text documents is given by [s1, s2] where s1 =’John likes to program

Vanessa also likes to program’ and s2 =’John also likes to go to concerts’. Using the unigram model,

the list of unique words, also described as the vocabulary, is given by [’John’, ’likes’, ’to’, ’program’,

’Vanessa’, ’also’, ’go’, ’concerts’]. Assuming the same ordering, the vectorized documents are given by

table 3.4.

Table 3.4: Bag of Words representation for the dummy example.

’John’ ’likes’ ’to’ ’program’ ’Vanessa’ ’also’ ’go’ ’concerts’s1 1 2 2 2 1 1 0 0s2 1 1 2 0 0 1 1 1

As can be noted, the vectorized representation of each document did not preserve the original order,

loosing its syntactical and semantic meaning, and in this simple example it is noticeable the zero entries.

When the size of the vocabulary start to increase, the sparseness of the feature vectors also increase.

Similarly, the BoW model can consist on count frequencies of bigrams, trigrams, combination between

uni and bigrams, etc..

3.6.2 Term Frequency-Inverse Document Frequency

Term Frequency-Inverse Document Frequency (tf-idf) is a class of techniques to represent a doc-

ument in a vectorized form, and it is useful in identifying signature N-gram in a document. The Term

Frequency, expressed in equation 3.6.1, is related with the output of the BoW model since it is defined

as how frequently the N-gram appears in the document, measuring the local importance of it.

tf(N−gram) =Number of times the N−gram appeared on the document

Number of N−grams in the document (3.6.1)

The inverse document frequency, expressed in equation 3.6.2, is the key factor in identifying the

signature N-gram. It is based on the fact that less frequent N-grams are more informative and important.

For an N-gram to be considered a signature N-gram of a document, it should not appear that often in

the other documents. Thus, a signature N-gram’s document frequency must be low, meaning its inverse

document frequency must be high.

idf(N−gram) = log10

(Number of documents

Number of documents containing the N−gram

)(3.6.2)

The tf-idf is the product of these two frequencies. For an N-gram to have high tf-idf in a document, it

must appear several times in that document and must be absent in the other documents. Thus being a

signature N-gram of the document.

Resuming the previous example, the term frequency for each document is given by table 3.5, which

was expected. Computing the idf terms yields the results showed in table 3.6. Since we have the

37

values for both elements, the tf-idf weights are illustrated in table 3.7. With this we can conclude that the

best unigrams that describe the first sentence is ’program’ and ’Vanessa’. There is a tie for the signature

unigram, for the second sentence, between ’go’ and ’concerts’ unigrams. Another consideration to make

is the sparsity level of the feature vectors, where one can conclude that the feature vectors generated,

either by BoW or tf-idf techniques, are sparse.

Table 3.5: Term Frequency of each unigram to its corresponding sentence for the dummy example.

’John’ ’likes’ ’to’ ’program’ ’Vanessa’ ’also’ ’go’ ’concerts’s1 0.11 0.22 0.22 0.22 0.11 0.11 0 0s2 0.14 0.14 0.29 0 0 0.14 0.14 0.14

Table 3.6: Inverse Document Frequency (idf) of each unigram for the dummy example.

’John’ ’likes’ ’to’ ’program’ ’Vanessa’ ’also’ ’go’ ’concerts’idf 0 0 0 0.3 0.3 0 0.3 0.3

Table 3.7: Term Frequency - Inverse Document Frequency of each unigram to its correspondingsentence for the dummy example.

’John’ ’likes’ ’to’ ’program’ ’Vanessa’ ’also’ ’go’ ’concerts’s1 0 0 0 0.066 0.033 0 0 0s2 0 0 0 0 0 0 0.042 0.042

38

Chapter 4

Methodology

The purpose of chapters 2 and 3 was to give some insight of the methods, in a theoretical point of

view, and the importance of them. The goal of the present chapter is to elucidate the reader the way how

these methods were used and how they are connected. An overview of the methodology is presented

in figure 4.1.

Raw Data

PreprocessingModule

Is the data quality

good?

No

YesMachineLearningAlgorithm

AlgorithmHyperparameters

Sampling Training Dataset

Testing Dataset

Numerical Data

Textual Data

Scaler

Vectorization

Vectorization Hyperparameters

Hypothesis Space

HyperparameterOptimization

ModelEvaluation

Modeldescribes the

data?

Final Model

Yes

No

Modelgeneralized?

EndYes

No

Figure 4.1: Overview of the Methodology steps.

39

4.1 Data Preprocessing

In this work, focus was given to textual data preprocessing. However, a preprocessed dataset with

numerical data was also used. The numerical data consists on vital signs and demographic information

collected during triage. In this dataset of numerical data the outliers were treated based on the vital

signs ranges, [97–99], and the method of missing data imputation consisted of the imputation of the

mean value of the triaged population for each variable.

The textual data consists on the main chief complaint which notes the reason the patient was seen.

This clinical note is written by health care professionals to communicate the status of a single patient to

other health care professionals or themselves. A few issues were detected and must be tackled in order

to improve data quality because of the clinical notes particularities:

• The clinical notes do not have the typical structure of a typical written text - Lack of syntactic-

semantic;

• Each clinical note is given by short sentences;

• The notes have several technical and non-technical abbreviations. E.g. ’msd ’ meaning ’membro

superior direito’ (’right upper limb’) and ’2af ’ meaning ’segunda-feira’ (’monday ’);

• These notes are filled with specialized technical terms;

• Context is very important for the meaning of a word;

• There are many numerical values with respect to vital signs;

• Several ways of representing the same information;

• Many misspelled words;

• An huge amount of joined words. E.g., ’textualdata’.

Given the amount of noise sources, it is extremely important to develop a Natural Language Pro-

cessing (NLP) system capable of filtering the noise so that data quality improves leading to a better

data mining process. The flowchart of the chosen sequence of preprocessing steps is depicted in figure

4.2, where the colour purple represents the initial and final data, the decision blocks are in grey, the red

corresponds to the tokenization process, in yellow is depicted the word correction step and the blocks in

blue illustrate the different text normalization steps.

40

Lowercase

Raw Text

TemporalReferences

Normalization

TokenizationCorrectlyNormalized?

No Yes CorrectlyTokenized?

No

AbbreviationExpansion

Word Correction Successfully Corrected?

No

Word Removal

Stemming

Processed

Text

Yes

Yes

Figure 4.2: Overview of the Natural Language Processing framework used to improve the quality of themain chief complaint data.

4.1.1 Text Normalization

As explained in section 3.4 this step is of great importance in order to reduce sparseness in data,

since reducing data sparsity results in less complex models.

The normalization process starts with lowercasing all main chief complaints since lowercasing does

not affect the healthcare sublanguage. Following the lowercase step comes the step of temporal ref-

erences normalization. This step consists on detecting dates and references of months, weeks, days,

hours, minutes and seconds. Starting with the normalization of dates there are some aspects to be

considered:

• A date can be given by the day, month and year;

• A date can be given by the day and month;

• A date can be given by the month and year;

• A date can have spaces between its components and the respective delimiter. E.g., ’02 -mar-17 ’

instead of ’02-mar-17 ’;

• The day can have a ’0’ before the day number (only when the day number is given by one digit);

• The month can be represented by either by its month number (month numbers with only one digit

occasionaly have a ’0’ preceding it), month name, or an abbreviation of month name. E.g., ’8’,

’08’, ’august ’, ’aug’;

• The year can be either represented by a four digits number or a two digits number. E.g., ’2019’,

’19’.

41

Given the existing variability in the representation of a simple date some assumptions were made:

• A healthcare provider always uses the same delimiter for the same date. E.g., ’14-5-2017 ’,

’14/5/2017 ’. ’14.5.2017 ’;

• A healthcare provider may change the type of delimiter during the typing process. E.g., ’patient

operated on clavicle on 12-3-2015 and refers removing tonsils on 4.7.17 ’;

• The format of the date is given by the following format: dd /mm/yyyy (the ’/ ’ delimiter was used just

as an example);

• Only considered years from ’2000’ to ’2016’;

• Only the ’-’, ’/ ’, ’.’ characters are considered as delimiters.

In order to successfully find date matches one resorted to RE and for simplicity only the developed

RE for the ’−’ delimiter will be illustrated since it is only necessary to replace it with one of the other

delimiters. An overview of the different steps when normalizing dates is presented in figure 4.3.

Start

Complete date Match?

Complete DateNormalization

Yes

Month Year date match?

No Day Month date match?

No

Month YearDate

Normalization

Yes

Day MonthDate

Normalization

Yes

EndNo

Chief MainComplaint

Figure 4.3: Normalization sequence steps when dealing with dates.

Given that spaces may be present in a date it is necessary to handle them since the number of

possible combinations between date components is high. ’E.g.’, ’02- 02-2015’, ’02- 02 -2015’, ’02- 02-

2015’, ’02- 02 - 2015’, etc.. In order to deal with this characteristic, when evaluating the existence

of a date in a string the RE, REspacing = (? : (\s)∗) was developed. The workframe starts with the

normalization of complete dates, where the used RE is given by:

REDayREspacing \ −REspacingREMonthREspacing \ −REspacingREY ear

The RE expressions for REDay, REMonth and REY ear are provided in table 4.1. The RE developed

for the days is given by a non-capturing group, characterized by ’? :’ with three alternatives. The first

42

one is given by ’0?[1 − 9] where it matches a single character between 1 and 9, ’[1 − 9]’, even if a zero

exists before it or not, ’0?’. The second alternative, ’[12][0 − 9]’, matches days that start with a one or a

two followed by any digit in the range of 0 to 9, i.e., every day between 10 and 29. The third alternative,

’[3][01]’, matches every day that starts with a three and is followed by a zero or a one, i.e., days 30 and

31. The same strategy was performed in order to deal with the months and years. The RE for the months

is also given by a non-capturing group with 27 alternatives, the first two, 0?[1− 9] and 1[0− 2], are used

to capture months given by their numerical value and the remaining 25 alternatives deal with months

expressed by either their month name or an abbreviation for the month name. The RE for the years is

also given by a non-capturing group with four alternatives where the first two, ’200[0− 9]’ and ’201[0− 6]’,

deal with years represented as a four digit number and the remaining two alternatives, 0?[1 − 9] and

1[0− 6], deal with years given by two digits.

Table 4.1: Regular Expressions (RE) used to capture different date components when date is given bya day, a month and a year.

Regular Expression (RE)

REDay ’(? : 0?[1− 9] | [12][0− 9] | [3][01])’

REMonth ’(? : 0?[1− 9] | 1[0− 2] |janeiro|jan|fevereiro|fev|marco|mar|abril|maio|junho|julho|

jun|jul|julh|agosto|ago|setembro|set|setem|outubro|out|novembro|nov|dezembro|dez)’

REY ear ’(? : 200[0− 9] | 201[0− 6] | 1[0− 6] | 0?[1− 9])’

In the second step, incomplete dates given only by its month and year, are analyzed. In order to

avoid wrong normalizations, due to the overlaps between the two types of incomplete dates, i.e., strings

given by any number from 1 to 12, a delimiter, and any number between 10 and 12, the normalization

process only occurs if at least one of the following scenarios happen:

• Scenario 1 - The month is given by its month name, or a corresponding abbreviation of it, and

followed by a delimiter and a number with either two, from 01 to 16, or four digits, from 2000 to 2016;

• Scenario 2 - The year is either given by a four digit 6number, from 2000 to 2017, or a digit between

13 and 16, and there either exists a preceding number, the month number, between 1 and 12, or

the month name, or a corresponding abbreviation of it.

When dealing with this type of incomplete date, the RE is given by:

REMonthREspacing \ −REspacingREY ear

The expressions for REMonth and REY ear are illustrated in table 4.2 for each scenario.

43

Let’s assume that the following sentence is being processed: ’today25- 4-2016is the day of Ana’s

birthday. 25-4 is also the liberty day of Portugal ’. As already mentioned, this date is complete, has

spaces between its elements, and some of those elements are joined with other substrings. The benefit

of the developed RE is that it is possible to match the date, due to the non-capturing groups strategy.

Firstly, when joining date delimiters the string turns to ’today25-4-2016is the day of Ana’s birthday. 25-4 is

also the liberty day of Portugal ’. Secondly the string goes to the second step since there exists a match.

The outputted string is ’today 25abril2016 is the day of Ana’s birthday. 25-4 is also the liberty day of

Portugal ’. Thirdly the string fails to comply with both situation, given that this the string is considered not

to have any incomplete dates given by a month and a year. Finally, the string has information regarding

an incomplete date whose numerical components are given by a day and a month since situation 2

verifies. The output of this step is ’today 25abril2016 is the day of Ana’s birthday. 25abril is also the

liberty day of Portugal ’.

Table 4.4: Outputted normalized date format for each normalization step.

Normalized Date StructureComplete Date ddmonth nameyyyy

Incomplete Date - Month/Year month nameyyyyIncomplete Date - Day/Month ddmonth name

After dealing with dates, the second sub module of temporal references normalization step is

executed. The goal at this stage is to normalize all references to months, weeks, days, hours, minutes

and seconds. It is important to detect these components in a string is because of what is associated

to them, that being an event. In the healthcare sublanguage, the type of event can be anything like

operations, duration of a convulsion, time when the medication was taken, etc.. There are some aspects

that need to be taken into account:

• References to hours may be given only by its hour or by its hour and minute;

• Hours and minutes have different types of delimiters;

• Hours and minutes may have spaces between components;

• These temporal references may be expressed by abbreviations or words with typographical errors;

• These references may be depicting time intervals or the exact moment of the event. E.g., ’took the

medicine at 11:00 a.m.’ and ’took the medicine 1 hour ago’ have different meanings.

Since the vast majority of these references deals mostly with hours and minutes these will be the

only references being described since the remaining types of references follow the same approach,

which consists on several RE illustrated in table 4.5. The first step of the framework is the detection

of references depicting time intervals. In order to disambiguate between these type of references and

the ones that represent the moment of the event one resorted to non-capturing groups with some of the

most common used words, within this scenario, and some abbreviations and misspells of those words.

45

Starting with references containing both hours and minutes components, the RE can be summarized as

the concatenation of the following RE:

REDisambiguation REspacing REHours REspacing REHours Delimiter

REspacing REMinutes REspacing REMinutes Delimiter

REDisambiguation is responsible to detect the most common words used in the studied sublanguage

when the temporal reference describes a time interval, where + imposes that it needs to appear at least

once. REspacing deals with whitespaces between the components, REHours detects hours in the range

of [0−24], REMinutes detects the minutes components from 0 to 59, REHours Delimiter detects the most

frequent delimiters when time is given by hours and minutes, where it was imposed that it needs to match

at least once, and REMinutes Delimiter detects the references to minutes although is not necessary that

it appears in the string since sometimes it is omitted. E.g., ’08h20’.

After the references given by hours and minutes being normalized, the normalization of the remaining

temporal references is performed. The general RE is given by:

REDisambiguation REspacing REi REspacing REi Delimiter

i = {Months,Weeks,Days,Hours,Minutes, Seconds}

Table 4.5: Regular Expressions (RE) used to match temporal references given by months, weeks, days,hours, minutes and seconds.

Regular Expression (RE)REHours ’(? : 2[0− 4] | 1[0− 9] | 0?[0− 9])’REMinutes ’(? : [1− 5][0− 9] | 0?[0− 9])’RESeconds ’(? : [1− 5][0− 9] | 0?[0− 9])’REDays ’(? : 3[01] | [1− 2][0− 9]) | 0?[0− 9])’REWeeks ’(? : 2[0− 4] | 1[0− 9] | 0?[0− 9])’REMonths ’(? : 1[0− 2] | 0?[0− 9])’

REHours Delimiter ’(? : (: |h|h : |hs|hh|hora|horas|he|horae|horase)+)’REMinutes Delimiter (? : (m|min|minuto|minutos)?)

RESeconds Delimiter (? : (segundo|segundos)?)

REDays Delimiter (? : (d|dia|dias)?)

REWeeks Delimiter (? : (semanas|semana|sema|sem)?)

REMonths Delimiter (? : (meses|mes|mes)?)

REDisambiguation (? : (ha|ha|ha|a|a|a)+)’REspacing (? : (\s)∗)

Afterwards the goal is to normalize temporal references that describe the moment when the event

occured. The only difference between the developed RE for this scenario and the previous one is that

the term REDisambiguation does not exist in this one.

Previously, it was explained how to detect if a string has a substring that matches a temporal refer-

ence. The normalization process consists on getting the time period of the event, named as tEvent, and

46

its process is different depending if the information is given by a time period or by the moment of the

event. If a time period is present in the extracted information, ∆t, then it will be subtracted from the mo-

ment when the main chief complaint was created, expressed as tCreation, as depicted in equation 4.1.1.

From the resulting time the hour of the occurrence of the event is extracted and one sees in which day

period it belongs to. If the moment of the event is present in the extracted information, the hour from the

extracted information is used directly and one gets the day period when the event occured. The different

partitions of day periods are illustrated in table 4.6. Suppose that the sentence being processed is given

by ’tomou o medicamento as 11:00’ (’took the medicine at 11:00’). This sentence will fail all matches

of temporal references depicting time intervals. During the next step it will match the first RE since this

reference has both hour and minute as elements. The outputted normalized sentence is thus ’tomou o

medicamento as manha’ (’took the medicine at morning’). Suppose now that the sentence is modified

to ’tomou o medicamento ha 1h atras’ (’took the medicine 1h ago’) and that this main chief complaint

was created in ’25-4-2017 12:00’. This sentence will match during the detection of temporal references

prescribing time intervals and the calculation of the time at which the event happened is 12 − 1 = 11

resulting in an hour within the morning range. This sentence will not match the second set of RE since it

was already treated. The outputted sentence is thus ’tomou o medicamento manha’ (’took the medicine

morning’).

Table 4.6: Partition of the day into six hours day periods.

Day Period Range of HoursDawn 00:00 - 05:59

Morning 06:00 - 11:59Evening 12:00 - 17:59

Night 18:00 - 23:59

tEvent = tCreation −∆t (4.1.1)

The next step of the normalization process deals with the abbreviation expansion technique. When

this step happens one needs to take into account that tokenization has already been done. In order to

know what were the existing abbreviations in the data all of the vocabulary was analyzed manually since:

• The majority of the abbreviations are very technical. E.g., ’tce’, stands for ’traumatismo cran-

ioencefalico’ (’cranioencephalic traumatism’);

• Different abbreviations have the same meaning. E.g., ’dro’ and ’dro’ stand for ’doutor ’ (’doctor ’);

• The same abbreviation may have different meanings depending on its context. E.g., ’dm2’ can

stand for ’diabetes mellitus tipo 2’ (’type 2 diabetes mellitus’) or ’decımetro quadrado’ (’square

decimeter ’).

In order to solve this problems extensive research was done to understand the meaning of those

abbreviations and the context was studied to detect patterns in the different main chief complaints to

47

allow the generation of rules to overcome the ambiguity. These rules work by analyzing the token

neighbours. E.g. regarding the ’dm2’ abbreviation: If the previous token is of type integer or float then

’dm2’ stands for ’decımetro quadrado’. Otherwise it stands for ’diabetes mellitus tipo 2’.

The final technique that was used to normalize the textual data consists on stemming each word in

the vocabulary using the RLSP Stemmer.

4.1.2 Tokenization

In this work sentences were tokenized using a single RE composed by several alternatives. Since

there are misspelled and joined words some considerations were made taking into account the four

operations described at the beginning of section 3.3:

• A word has a maximum of one typo, for computational purposes;

• The insert operation must not be considered since there are joined words.

• The QWERTY keyboard configuration was taken into account to reduce the number of possible

typos.

Given this, each word in the developed corpus for this work was subject to deletion, transposition,

and replacing operations with the QWERTY keyboard constraints in order to obtain all of the unique

words correctly spelled and the ones with only one typo. E.g., some of the results after these operations

on the word ’sol ’ (’sun’) are:

• Deletion - ’ol ’, ’sl ’;

• Replacing - ’dol ’, ’sll ’;

• Transposition - ’slo’, ’osl ’.

After obtaining all possible combinations of words with only one typographical error, the words from

the corpus were added and the abbreviations since some of the sentences have substrings given by

joined words. This words may be any combination of abbreviations, correctly spelled words, or words

with typos. All of the normalized temporal references were also added in order to consider them as they

are. The alternatives of the RE consist of all those words which were were sorted alphabetically and by

length in order to break the sentence where it greedily matched with the longest word.

48

4.1.3 Word Correction

The portuguese language has some special peculiarities that need some attention. It is very com-

mon to have words with diacritical marks, e.g. acute accent, circumflex, cedilla, among other diacritics.

Some of the typos are due to not using this type of marks, e.g., ’pesames’ instead of ’’pesames’ ’ (’condo-

lences’) which yields a Jaro-Winkler distance of djw (s1, s2) = 0.186. Given this, and in order to enhance

this technique in situations where such typos happen, the word comparison function, given by 4.1.2, also

takes into account words in their ASCII encoding representation. By doing this the Jaro-Winkler distance

when the ASCII string is used is djw(s1 =′ pesames′, s2 = ASCII(′’pesames’′)

)= 0.0 yielding a total

Jaro-Winkler distance of 0.5 · 0.186 + 0.5 · 0 = 0.093. The weights were chosen in order to maximize the

minimum similarity value between strings for the threshold, where s1 describes the string being evalu-

ated and s2 corresponds to the string from the corpus being used. The maximum Jaro-Winkler distance

that is considered between two strings is 0.15. The pseudocode 1 illustrates the developed spellchecker

where the Compare function is given by equation 4.1.2.

djw (s1, s2) = 0.5 (1− simjw (s1, s2)) + 0.5 (1− simjw (s1, ASCII(s2))) (4.1.2)

Algorithm 1: SpellChecker to correct misspelled wordsInput: s : String being evaluatedInput: C : List of WordsOutput: s : Corrected String

1 Initialize Dictionary of Corrected Words, CW = dict(); previous similarity = 02 for each word ∈ C do3 if s ∈ C then4 s← s5 Output s6 else if s ∈ CW then7 s← CW [ s ]8 Output s9 else

10 similarity = Compare(s, word)11 if (similarity ≤ 0.15) ∧ (similarity > previous similarity) then12 s← word13 previous similarity ← similarity14 auxiliary ← True15 end16 end17 end18 if auxiliary then19 CW [ s ]← s20 Output s21 else22 s← s23 Output s24 end

49

4.1.4 Word Removal

Despite the lack of a good grammar in this type of data sometimes the healthcare provider uses

linkers and connectors, which in the field of NLP are called stopwords, in a sentence. Since these

stopwords do not provide any information regarding the patient state, they were filtered from the chief

main complaint. It was also necessary to remove words with less than three characters due to some

noise coming from the tokenization process.

Example using the developded Natural Language Processing framework

In order to best understand all the steps an example is provided. The chief main complaint that is

going to be pre-processed is given by: ’DOENTE queixosa difresp dum17.04 abrilamnorreia Hospitald

aluz ’. The following listing summarizes the output of each step and comments are provided when

necessary:

• Lowercasing - ’doente queixosa difresp dum17.04 abrilamnorreia hospitald aluz ’;

• Temporal References Normalization - ’doente queixosa difresp dum 17abril abrilamnorreia

hospitald aluz ’. It verifies condition 2 for incomplete dates expressed by a day and a month and

since whitespacing was not considered as a delimiter the substring ’04 abril ’ would never match;

• Tokenization - [’doente’, ’queixosa’, ’dif ’, ’resp’, ’dum’, ’17abril ’, abril ’, ’amnorreia’, ’hospital ’,

’d ’, ’a’, ’luz ’]→ ’doente queixosa dif resp dum 17abril abril amnorreia hospital d a luz ’;

• Abbreviation Expansion - ’doente queixosa dificuldade respiratoria data ultima menstruacao

17abril abril amnorreia hospital d a luz ’;

• Word Correction - ’doente queixosa dificuldade respiratoria data ultima menstruacao 17abril

abril amenorreia hospital d a luz ’;

• Word Removal - ’doente queixosa dificuldade respiratoria data ultima menstruacao 17abril

abril amenorreia hospital luz ’ (’patient complaining respiratory difficulty date last menstrual period

17april amenorrhea hospital luz ’);

• Stemming - ’doent queix dificuldad respirator dat ult menstru 17abril abril amenorre hospit luz ’.

It is noticeable the good performance of the developed NLP framework in transforming the raw main

chief complaint into a textual document with meaning. The main disadvantage of this framework is that

during Tokenization some noise may be present in data as one can see in the previous example. This

was the main reason for the creation of the Word Removal step. After pre-processing all of the main

chief complaints the data quality increased and it is now ready to be used to develop machine learning

models.

50

4.2 Data Modeling

The dataset was first shuffled and then sampled in a stratified fashion into a training set and a

testing set. Posteriorly a set of pipelines, illustrated in figure 4.4, were developed in order to deal

with the set of predictors used for training and to help during the hyperparameter optimization step.

When the chief main complaint was one of the predictors it is necessary to extract features from the

text data using either BoW or tf-idf. The associated hyperparameters are the range of N-grams, i.e. if

the vectorization takes into account only unigrams, bigrams, combination of unigrams and bigrams, etc.,

and the size of the vocabulary to be considered in order to filter some of the infrequent n-grams resulting

in less complex models. When vital signs were used as predictors the associated features were scaled

using the min-max normalization method, ranging from 0 to 1, in order to reduce data redundancy and

improve its integrity. Feature scaling is suggested for the majority of the considered learning approaches,

moreover MNB and CNB algorithms do not accept negative values for the features. An input ’Grid

of parameters’ is illustrated in the different pipelines of figure 4.4, followed by a Uniform Distribution.

(This is represented since) A Random Search was used with 10-fold cross-validation to choose the

hyperparameters of the considered hypothesis space for the vectorization and learning strategies, with

the goal of maximizing the AUC-ROC score.

Grid ofParameters

UniformDistribution

Set ofParameters

Main ChiefComplaint

FeatureExtraction

(Vectorization)

MachineLearningAlgorithm

(a) Pipeline using the Chief Main Complaint as Predictor.

Grid ofParameters

UniformDistribution

Set ofParameters

ContinuousNumerical Data


Min-Max Scaler

Categorical Data

(b) Pipeline using numerical and categorical variables asPredictors.

Grid ofParameters

UniformDistribution

Set ofParameters

Main ChiefComplaint

FeatureExtraction

(Vectorization)


ContinuousNumerical Data

Min-Max Scaler

Categorical Data

(c) Pipeline using the Chief Main Complaint, numerical,and categorical data as predictors.

Figure 4.4: Representation of the used pipelines during the hyper-parameter optimization step withrespect to the used predictors.

51

There is a difference between a model’s parameters and its hyperparameters. Model parameters are

learned during training, and model hyperparameters are set by the designer before training and control

implementation aspects of the model. E.g., the weights learned during training of a LR model are pa-

rameters while the type of kernel of a SVM is a model hyperparameter. Hyperparameters can be thought

of as model settings that need to be tuned for each problem since the best set of hyperparameters for

one particular dataset will not be the best across all datasets. The process of hyperparameter optimiza-

tion means finding the combination of hyperparameter values for a machine learning model in order to

enhance it performance. Random search was used as the strategy to tune the hyper-parameters of the

algorithms since the chosen hypothesis space is considerably large and this approach usually finds a

nearly optimal/optimal set of hyperparameters in few iterations [100]. Random Search chooses the set

of hyperparameters by uniformly selecting the values of each hyperparameter from the specified grid.

In order to evaluate the performance of a learning model given a set of hyperparameters, k-fold cross

validation since with this strategy it is not necessary to separate the training set into two sets, one for

training and the other for validation and data is better used. Another advantage of this technique is that

it should result in models capable of generalizing beyond the training data since one avoids overfitting

to a single and constant validation set. This approach, presented in figure 4.5, consists on splitting the

training set into k equally sized folds where k−1 of those are used for training and the remaining one for

validation. The validation and training sets are always looping. When developing these folds one applied

shuffling and stratified sampling.

Dataset

Trainonk-1stratifiedsplits

Validation

k-Fold

HoldTraining

Set Testing Set

Figure 4.5: Representation of the k-fold cross validation approach.

When sampling the dataset it was considered stratification of the dependent variable. The reason

of such consideration is due to the presence of class imbalance in the dataset as explained in section

5.1. Since the performance of some of the considered algorithms, LR and SVM, can deteriorate with

class imbalance it was applied a cost-sensitive learning strategy. This approach works by making the

learning model aware of the imbalanced data by incorporating the weights of each class, computed as

shown in equation 4.2.1, into the objective function. The weight associated to the less frequent class is

52

bigger when compared with the other class since the weights are inversely proportional to the frequency

of classes in the training data. Given that the weights are incorporated into the objective function and

that the minority class has an associated weight that is superior than the one corresponding to the most

frequent class the learning model is highly penalized when it missclassifies observations corresponding

to the minority class.

wyi =Number of Samples

Number of Classes×∑Nn=1(yi)

(4.2.1)

When using the LR and SVM, the parameters of these learning models, i.e. their weights, are

computed during training. One resorted to local-based methods using CD and SGD in order to see the

influence of the optimization algorithm in the performance of the model [101]. It is important to note

that the developed SVM models used the linear kernel due to the size of the dataset and the number

of features since the time complexity of a non-linear SVM ranges between O(nfeatures × n2samples) and

O(nfeatures × n3samples). The hyperparameters that were tuned for each of the learning algorithms can

be depicted in table 4.7.

Table 4.7: Hyperparameters for each learning strategy: Logistic Regression (LR) using CoordinateDescent (CD) or Stochastic Gradient Descent (SGD), Multinomial Naive Bayes (MNB), Complement

Naive Bayes (CNB), and Support Vector Machine (SVM) with either CD or SGD.

Algorithms Hyperparameters

LR CDC: Inverse of Regularization Strength

Regularization function: L1 or L2

LR SGD


λ: Regularization termLearning Rate schedule

Number of iterationsMNB α: Additive smoothingCNB α: Additive smoothing

SVM CDC: Inverse of Regularization Strength


SVM SGD


λ: Regularization termLearning Rate schedule

Number of iterations

4.3 Model Evaluation

After a model being trained it is necessary to know if it can describe the data and most importantly if

the model is capable to generalize beyond the seen data. During both training and testing bootstrapping

was applied for each performance metric and during testing McNemar’s Hypothesis test was added in

order to compare machine learning models’s error rates. The cut-off probability which presented the

best separation between classes was also computed.

53

Given that during the hyperparameter optimization step the objective was to maximize the ROC-

AUC score one resorted to the Youden’s J index to compute the optimal probability threshold during

the validation step since it guarantees a compromise between sensitivity and specificity [102]. For

each probability threshold value the Youden index is given by equation 4.3.1 and the optimal probability

threshold considered as the cut-off probability is the one that guarantees equation 4.3.2.

J = sensitivity + specificity − 1 (4.3.1)

max(J) (4.3.2)

In order to know the range of values for each performance metric, given a learning model, it was

computed bootstrapped 95% confidence intervals, during validation, with 200 bootstrap samples, as

illustrated in figure 4.6a. Given the chosen percentage for the confidence interval, the percentiles con-

sidered for calculation of the lower and upper bound are thus the α1 = 0.025 and α2 = 0.975 percentiles,

respectively. During testing, bootstrapping was also used to estimate 95% confidence intervals. The

model was subject to a resampled population in order to obtain its estimates, as shown if figure 4.6b.

For the computation of these confidence intervals, 200 bootstrap samples were also used.

When one has models that both represent the data and are capable to generalize well the following

question is raised: From the set of models which one is the best?. In this work one resorted to the

McNemar’s Hypothesis test to help answer that question to first conclude on if the models present

statistical significant error proportions in order to differentiate between them. This was employed only

during the testing step and the null hypothesis and the alternative hypothesis are given by:

• H0 - There is no significance between the models error rates on the test set;

• H1 - The models have a different error rates on the test set.

There may be cases where computing the uncertainty, subtracting the lower bound to the upper

bound of the bootstrapped 95% confidence interval, associated with the performance metrics, given

a learning model, and the result of McNemar’s Hypothesis test are insufficient to discriminate be-

tween models. Given this, some special precautions were taken into account to accommodate with the

limitations already described. In this work a model is selected analyzing:

• Its performance;

• Its uncertainty, where lesser is better;

• Generalization capability;

• Parsimony, in order to have lesser complex models.

54

Population

Resample


Model

Training Sample Testing Out ofBag Sample

SavePerformance

Reached number ofbootstrappiterations?

End

Yes

No

Start

(a) Bootstrapping strategy employed during the validationstep.

Population

Resample

MachineLearning Model

SavePerformance

Reached number ofbootstrappiterations?

End

Yes

No

Start

(b) Bootstrapping strategy employed during the testingstep.

Figure 4.6: Representation of the used bootstrapping strategies during validation and testing steps.

55

Chapter 5

Results

5.1 Database Description

The original dataset contemplates 850 189 entries of patient data, that went to the ED of Hospital

Beatriz Angelo, collected from 2012 to 2016. In order to have a clean dataset for further analysis, the

data was subject to a filtering process, as illustrated in figure 5.1, since:

• Only adults with age of at least 16 years old were considered for the study;

• Some inconsistencies where detected in the chief main complaint field where the discriminator

was used instead;

• In some cases the chief main complaint field was not filled;

• Some adult patients had pediatric discriminators;

• Some of the patients in the sample were retriaged.

As already stated at the beginning of section 4.1 a dataset, already preprocessed, with numerical

and categorical variables was presented. Some of the provided features were engineered, e.g. time

since a patient enters the ED and goes to the triage stage, number of exams that were made, number

of outliers present in data, etc., while others are standard features collected during triage, e.g. temper-

ature, main chief complaint, age, among others. Given this, a total of 40 variables were considered for

this study and are presented in table 5.1. This table is divided into three types of variables: Baseline

variables that contemplate vital signs, demographics, and dummy variables developed from the vital

signs; Text that has the main chief complaint; Additional variables consisting on priority according to

the Manchester Triage System, arrival mode of the patient, triage discriminators, Glasgow coma scale,

Glycemia level, and engineered variables like the number of exams that were made, number of outliers,

the hour, weekday, and month of the visit to the ED, the mean arterial blood pressure, the time that the

patient waited until being subject to triage, and the glycemia.

57

Patients that went to the EmergencyDepartment from 2012 to 2016 (n = 850189)

Adult cohort (n = 612093)

Excluded (n = 238096):

Patient Age < 16 years old

Considered in the present study (n = 511301)

Excluded (n = 100792):

No chief main complaint (n = 2484)Chief main complaint field the same as thediscriminator (n = 67156)Hospital Admission (n = 5328)ReTriageID specified (n = 8512)Patients with pediatric discriminators (n = 9) Intersect with the previously preprocessednumerical dataset (n = 17303)

Figure 5.1: Flow diagram outlining the inclusion and exclusion criteria.

Table 5.1: Predictor variables and outcome used for modelling Hospital Beatriz Angelo (HBA) ED data.

Variables Type

Baseline Age (years) Continuous

18 - 107

Vital signs

Respiratory Rate (RR) (breaths/min)

Heart Rate (HR) (beats/min)

Temperature (Temp) (◦C)

Pulse Oximetry (SpO2) (%)

Systolic Blood Pressure (SBP) (mmHg)

Diastolic Blood Pressure (DBP) (mmHg)

Pain Scale (PS) Categorical

0 - 10

Gender

1 (Female)

0 (Male)

58


Variables Type

Engineered Variables Missing Value in the RR variable (RR in) Categorical

Abnormal Values in the RR variable (RR out)

Missing Value in the HR variable (HR in)

Abnormal Values in the HR variable (HR out)

Missing Value in the DBP variable (DBP in)

Abnormal Values in the DBP variable (DBP out)

Missing Value in the SBP variable (SBP in)

Abnormal Values in the SBP variable (SBP out)

Missing Value in the Temp variable (Temp in)

Abnormal Values in the Temp variable (Temp out)

Missing Value in the Fly variable (Gly in)

Abnormal Values in the RR variable (Gly out)

Missing Value in the Oxi variable (Oxi in)

Abnormal Values in the RR variable (Oxi out)

Missing Value in the PS variable (PS in)

Missing Value in the GCS variable (GCS in)

Text Chief Main Complaint Textual

Additional Manchester Triage System (MTS) Categorical

1 (Emergent)

2 (Very urgent)

3 (Urgent)

4 (Standard)

5 (Non Urgent)

Arrival mode

1 (Walk-in)

2 (Ambulance)

3 (Not registered in system)

Triage discriminators

1 - 118

Glasgow Coma Scale (GCS)

3 - 15

Number of exams

59


Variables Type

0 - 3 or more

Number of missing vitals+pain level

0 - 7

Number of abnormal vital signs

0 - 4 or more

Triage hour

0 - 23

Triage weekday

1 - 7

Triage month

1 - 12

Mean Arterial Blood Pressure (MAP) (mmHg) Continuous

Waiting time for triage (Admn2Tr) (min)

Glycemia (Gly) (mg/dL)

Outcome ED Revisits Categorical

1 (Revisit)

0 (No Revisit)

With respect to the outcome, the original revisits dataset includes patients who have returned to the

ED within 72 hours, in some cases, patients return after 6 seconds of being discharged. Only patients

who returned to the ED after 1 hour of being discharged were considered as returned visits. Patients

who returned in an interval inferior to an hour were patients who: Were hospitalized; had a modification

in some field, e.g. chief main complaint, discriminator, etc.; went into labor; among other reasons. A

total of 28 973 patients satisfied the criteria for being considered as a patient who returned to the ED,

which corresponds to 5.7% of all patients considered for this study.

After filtering the dataset some exploratory data analysis was made by means of data visualization.

Firstly, univariate analysis was performed for a limited set of numerical features, illustrated in figure 5.2,

due to the number of features. The proportion of patients that revisits the ED is very similar between

age groups, with the exception of age group 40-64 where the percentage of revisits is approximately 2%

inferior when compared with the remaining groups as depicted in subfigure 5.2a. The cohort is mainly

composed by female patients but the percentage of revisits is basically the same between genders

as illustrated in subfigure 5.2b. Secondly one resorted to multivariate analysis by means of Pearson

Correlation between each feature and the outcome represented by the heatmap in figure 5.3, where for

60

light blue the correlation is negative (if one value increases the other decreases), dark blue corresponds

to highly correlated features, and cyan blue corresponds to not correlated variables. As one can see,

there is no high correlation between the variables and the outcome but there are some features highly

correlated. E.g., the Mean Arterial Blood Pressure is highly correlated with the Systolic Blood Pressure

(SBP) and Diastolic Blood Pressure (DBP) since MAP can be calculated using both values of SBP

and DBP as MAP = SBP+2×(DBP )3 . As an example of highly negatively correlated variables one

has the engineered features abnormal values and number of missing vitals (n missing vitals) which

seems reasonable since with the increase of missing data there are less abnormal values present in

the data. Focusing now on the textual data, a comparison of the top 20 most frequent N-Grams for

each class, as illustrated in figure 5.4. In order to best understand the meaning of those N-Grams, the

chief main complaint was used before the preprocessing stemming step. There is a total of 26 different

words among the unigrams presented and it is noticeable that the top 20 unigrams for patients that did

not return have more medical terms associated to them, e.g. ’cefaleia’ (’cephalgia’). Giving attention

now to the top 20 bigrams one notices that some bigrams, of patients that have returned to the ED,

represents pregnant female patients. For the other class there are several temporal references and pain

complaints. Finally, the top 20 trigrams of patients revisiting the ED pair with the result of bigrams since

they are associated with pregnant patients. Since the female gender has greater representativity within

the data it is expected a higher number of terms associated with this gender. The terms associated

with females are mostly related with pregnancy. There are approximately 20 555 pregnant patients in

the dataset, corresponding to 4.02% of the cohort. Among these patients, 2 482 were considered as

readmitted within the ED, corresponding to 8.57% of all patients belonging to class 1, i.e. readmitted

patients.

The dataset was partitioned into two sets using shuffling and stratified sampling with respect to the

outcome due to data imbalance. Those sets are the training and testing sets where the training set

comprehends 70% of all the available data, i.e. 357 910 patients, and the testing set has the remaining

30% of the data, i.e. 153 391 patients.

61

01

0

1 0

101

18 - 39

40 - 64

65 - 84

85+

0 - 93.8 %1 - 6.2 %0 - 95.4 %1 - 4.6 %0 - 94.1 %1 - 5.9 %0 - 93.3 %1 - 6.7 %

(a) Distribution of patients by age groups with therespective proportions of revisits (1) and no revisit

(0) for each group.

01

0

1

Male

Female

0 - 94.6 %1 - 5.4 %0 - 94.2 %1 - 5.8 %

(b) Distribution of patients according to gender withthe respective proportions of revisits (1) and no

revisit (0) for each group.

Figure 5.2: Univariate Analysis of age groups and gender with respect to the outcome. Age Groupsaccording to [103–105].

Arriv

al M

ode

MTS

Gend

erAd

m2T

r_m

inTr

iage

Hou

rTr

iage

Mon

thTr

iage

Wee

kday

Tria

ge D

iscrim

inat

ors PS HR DBP

SBP

Tem

pGl

yGC

SOx

iRR MAP

PS_in

HR_in

DBP_

inSB

P_in

T_in

Gly_

inGl

as_in

Oxi_i

nRR

_inHR

_out

DBP_

out

SBP_

out

Tem

p_ou

tGl

y_ou

tOx

i_out

RR_o

utNu

mbe

r of m

issin

g vi

tals

Num

ber o

f abn

orm

al v

itals

Num

ber o

f Exa

ms

Age

Outc

ome

Arrival ModeMTS

GenderAdm2Tr_minTriage Hour

Triage MonthTriage Weekday

Triage DiscriminatorsPSHR

DBPSBP

TempGly

GCSOxiRR

MAPPS_inHR_in

DBP_inSBP_in

T_inGly_in

Glas_inOxi_inRR_in

HR_outDBP_outSBP_out

Temp_outGly_outOxi_outRR_out

Number of missing vitalsNumber of abnormal vitals

Number of ExamsAge

Outcome

0.6

0.4

0.2

0.0

0.2

0.4

0.6

0.8

1.0

Figure 5.3: Pearson Correlation between features, described in table 5.1, and the outcome.

62

02000400060008000Unigram Frequency

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

dortardemanhãurgênciaserviçodesdereferemadrugadaontemgeralapresentanoitedireitoabdominalesquerdohojeíndiceobstétricoqueixascolo

(a) Unigram Frequency for patients who returned to theED.

0 25000 50000 75000 100000 125000 150000 175000Unigram Frequency

tosseregiãoedemaqueixascefaleiaontemhoje

membroabdominal

noitetraumatismomadrugadaapresenta

manhãreferedireito

esquerdodesdetardedor

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

(b) Unigram Frequency for patients who did not return tothe ED.

010002000300040005000Bigram Frequency

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

serviço urgênciaurgência geraldor abdominalíndice obstétriconeste serviçodesde ontemirruptiva oncológicasaco amnióticodiretivas avançadashospital santasanta mariamembro inferioramniótico integrohemibloqueio anteriordesde tardealgias pélvicasdesde manhãtoque colodor lombarvinda serviço

(c) Bigram Frequency for patients who returned to theED.

0 5000 10000 15000 20000 25000 30000 35000Bigram Frequency

direito desdedesde hojedesde noite

algias pélvicasúltima menstruação

dor edematraumatismo craniencefálico

dor torácicaserviço urgência

refere dormembro superior

dor lombardiretivas avançadashipertensão arterial

desde manhãmembro inferioríndice obstétrico

desde tardedesde ontem

dor abdominal

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

(d) Bigram Frequency for patients who did not return tothe ED.

0500100015002000250030003500Trigram Frequency

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

serviço urgência geralneste serviço urgênciahospital santa mariasaco amniótico integrovinda serviço urgênciaontem serviço urgênciaíndice obstétrico 0000data última menstruaçãoviatura médica emergênciamédica emergência reanimaçãoescala coma glasgowtarde irruptiva oncológicasangue tipo arhsanta maria ondeíndice obstétrico 1001segundo tarde irruptivaurgência geral emotivomembro inferior direitoirruptiva oncológica 0000sangue tipo orh

(e) Trigram Frequency for patients who returned to theED.

0 2000 4000 6000 8000Trigram Frequency

diabetes mellitus tipodor membro inferior

apresenta diabetes mellitushospital santa maria

irradiação membro inferioracidente vascular cerebral

índice obstétrico 0000serviço urgência geral

craniencefálico perímetro cefálicotraumatismo craniencefálico perímetro

vinda serviço urgênciaapresenta hipertensão arterial

membro superior direitomédica emergência reanimação

viatura médica emergênciamembro superior esquerdomembro inferior esquerdomembro inferior direitoescala coma glasgow

data última menstruação

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

(f) Trigram Frequency for patients who did not return tothe ED.

Figure 5.4: Comparison between the top 20 most frequent N-Grams (N = 1, 2, 3), before stemming, foreach label.

63

5.2 Model Comparison

The different hypothesis of the considered hypothesis space are the following:

• Baseline - Baseline variables described in table 5.1 were considered;

• All Numeric - The Baseline and Additional variables were used;

• Textual - Only the main chief complaint was used;

• Textual and Baseline - The Baseline variables were used along with the main chief complaint;

• Textual and All Numeric - The Textual variable was used alongside the All Numeric variables.

The developed models using the Baseline variables showed low predictive power, as can be seen

in table A.1, since they present low values regarding the AUC-ROC metric. This suggests that using

only the baseline features generates models that cannot discriminate the true state for each patient in

the cohort and that one should consider more features to improve model performance. The results of

the McNemar’s Hypothesis Test, given a significance level of 5%, it is concluded that the MNB and CNB

models failed to reject the null hypothesis, H0, as illustrated in table 5.7. This means that the error

rate is similar between these models, which was expected since the results are the same. From among

these models, the one that shows a better performance is the LR model using the SGD optimization

technique to compute the weights with a value of AUC−ROC = 0.609 and values of 0.567 and 0.630 for

the sensitivity and specificity measures, respectively.

When one considers the hypothesis of using all of the available continuous and categorical numerical

variables, i.e. using the All Numeric predictor set, one sees an improvement regarding the performance

of the models when compared with the ones that only considered the Baseline variables, as illustrated

in table A.2. Despite the performance improvement there still exists a gap between the trade-off re-

garding sensitivity and specificity since these models show a better sensitivity than specificity. Like

the conclusion regarding McNemar’s Hypothesis Test of the Baseline hypothesis, the only models that

showed similar error rates were the MNB and CNB models, as showed in table 5.7. This conclusion was

expected given that these models presented the same predictive skill. Only resorting to the AUC-ROC

metric there is a tie between the best model, those being the SVM, and, once again, LR models with a

value of AUC−ROC = 0.782. The respectively sensitivity and specificity values are 0.731 and 0.674 for

the LR model and 0.734 and 0.672 for the SVM model. In order to choose a model one resorted to the

uncertainty for each performance metric using the computed 95% bootstrapped confidence intervals as

illustrated in table 5.2

Given this, the best model is the LR model since it has a lower uncertainty regarding its prediction

skill when compared with the SVM model. When one compares the performance of the LR model given

this hypothesis with the one using only the Baseline variables one notes an improvement of 17.3%,

18.7%, and 5.9% for the AUC-ROC, Sensitivity, and Specificity, respectively.

64

Table 5.2: The uncertainty associated with the candidate models under the All Numeric Hypothesis.

All Numeric Hypothesis

Candidate Models AUC-ROC Sensitivity Specificity

LR 78.5%− 77.8% = 0.7% 74.2%− 72.0% = 2.2% 67.8%− 66.3% = 1.5%

SVM 78.5%− 77.8% = 0.7% 76.8%− 72.3% = 4.5% 67.8%− 64.4% = 3.4%

Focusing now on the Textual hypothesis one notes a slight improvement regarding the AUC-ROC

metric but there still exists the gap between sensitivity and specificity as mentioned before. Unlike

the models developed using All Numeric variables, these models demonstrate to better discriminate

patients that did not return to the ED than patients who did since the performance regarding specificity

is higher than sensitivity as can be seen in table A.3. Looking only to the AUC-ROC performance metric,

there are five candidates for being considered the best model. The LR and SVM when both BoW and

tf-idf feature extraction techniques are employed, and the LR with the SGD optimization technique and

tf-idf as the feature extraction approach. Noting also the sensitivity and specificity values one adds

the MNB and CNB when tf-idf is the feature extraction strategy. The results given by the McNemar’s

Hypothesis Test, illustrated in table 5.7, accounts two of the candidate models that failed to reject H0

those being LR when BoW is applied and SVM with tf-idf features with a p-value of 0.459. This meaning

that these two models fail to classify, in different situations, with similar error proportions. Due to this

result, between the LR and SVM models, only the LR with BoW features model will be considered for

analysis regarding the uncertainty, illustrated in table 5.3.

Table 5.3: The uncertainty associated with the candidate models under the Textual Hypothesis.

Textual Hypothesis


LR BoW 79.1%− 78.0% = 1.1% 71.5%− 67.5% = 4.0% 74.9%− 70.8% = 4.1%

SVM BoW 79.1%− 78.0% = 1.1% 70.9%− 67.8% = 3.1% 74.4%− 71.7% = 2.7%

LR tf-idf 79.4%− 78.3% = 1.1% 70.3%− 67.7% = 2.6% 75.0%− 72.9% = 2.1%

LR SGD tf-idf 79.4%− 78.3% = 1.1% 71.2%− 65.9% = 5.3% 77.1%− 71.8% = 5.3%

MNB tf-idf 77.9%− 76.8% = 1.1% 70.4%− 67.0% = 3.4% 74.5%− 71.1% = 3.4%

CNB tf-idf 77.9%− 76.8% = 1.1% 70.2%− 66.8% = 3.4% 76.2%− 73.0% = 3.2%

Given the analysis of the uncertainty with respect to the candidate models, the one that demonstrates

to have a lower level of uncertainty is the LR model with tf-idf features which values for AUC-ROC,

Sensitivity, and Specificity are 0.789, 0.688, and 0.741, respectively. Comparing this LR model with the

best one that uses all numerical variables results in an improvement of 0.7% and 6.7% regarding AUC-

ROC and Specificity, respectively. There is a decline of −4.3% in sensitivity.

Following the Textual and Baseline hypothesis there are no major improvements regarding the

models performance comparing with the ones that only used the chief main complaint as illustrated in

table A.4. As before, analyzing only the values of AUC-ROC, Sensitivity and Specificity for the different

models the ones considered as candidates are the LR, LR with SGD, CNB, and SVM models, using the

65

BoW approach to extract features from the chief main complaint, and the LR, LR with SGD, MNB, CNB,

and SVM models, with tf-idf features. As illustrated in table 5.7, some of the candidate models present

similar error rates as they were not able to reject H0. As previously performed, only some models will

be considered during the uncertainty analysis, illustrated in table 5.4, due to the result of McNemar’s

Hypothesis Test.

Comparing the uncertainty regarding Sensitivity no conclusions can be taken since the results are

very similar between models. Evaluating the uncertainty of Specificity, the SVM model with tf-idf features

is the one that presents less uncertainty. Since this model and the LR with SGD training and tf-idf

features have similar error proportions, as depicted in table 5.7, one needs to compare the uncertainty

of these models, illustrated in table 5.5, to conclude which one is the best under this hypothesis.

Table 5.4: The uncertainty associated with the candidate models under the Textual and BaselineHypothesis.

Textual and Baseline Hypothesis


LR SGD BoW 78.8%− 77.7% = 1.1% 70.5%− 67.1% = 3.4% 75.1%− 71.1% = 4.0%

SVM BoW 79.4%− 78.3% = 1.1% 70.9%− 67.5% = 3.4% 75.4%− 71.8% = 3.6%

LR tf-idf 79.7%− 78.5% = 1.2% 70.0%− 66.8% = 3.2% 76.1%− 72.7% = 3.4%

SVM tf-idf 79.6%− 78.5% = 1.1% 70.2%− 66.7% = 3.5% 76.3%− 73.6% = 2.7%

Table 5.5: Comparison of the uncertainty between candidate models with similar error proportionsunder the Textual and Baseline Hypothesis.

Textual and Baseline Hypothesis


LR SGD tf-idf 79.5%− 78.4% = 1.1% 71.7%− 66.3% = 5.4% 77.5%− 71.3% = 6.2%

SVM tf-idf 79.6%− 78.5% = 1.1% 70.2%− 66.7% = 3.5% 76.3%− 73.6% = 2.7%

Since the uncertainty regarding the LR model with SGD training and tf-idf features is higher than the

SVM with tf-idf features, the last is the best model, under the Textual and Baseline hypothesis, with

values of AUC-ROC, Sensitivity, and Specificity of 0.791, 0.692, and 0.739, respectively. Comparing this

model with the LR model with tf-idf features under the Textual hypothesis there was an improvement

of 0.2% and 0.4% with respect to AUC-ROC and Sensitivity. Regarding Specificity, there is a decline

of −0.4%. As already stated, no significant changes occurred with respect to the performance of the

developed models under the Textual and Baseline hypothesis.

Analyzing the results of the last considered hypothesis, Textual and All Numeric, as illustrated in ta-

ble 5.8, one notes that the gap between the trade-off between Sensitivity and Specificity is much smaller

and the performance under this hypothesis has been largely improved. Following the same approach as

before, the candidate models are the LR, and SVM with BoW features, and the LR, MNB and SVM with

tf-idf features. From the candidates, the ones that showed not having a statistical difference regarding

performance were the LR and MNB when the tf-idf method was used to extract features from the chief

66

main complaint as can be seen in the last row of table 5.7. Given this, between these two models, only

the LR model will be considered when analyzing the uncertainty for each model.

The uncertainty of each candidate model, illustrated in table 5.6, indicates a tie with respect to LR and

SVM when the textual features were extracted resorting to the tf-idf technique. The performance values

of AUC-ROC, Sensitivity, and Specificity are, respectively: 0.842, 0.768, and 0.731, given the LR model,

and 0.855, 0.754, and 0.766 for the SVM. The major difference between these models is the performance

on detecting patients that did not return to the ED and are correctly identified as such. Looking at the

hyper-parameters for each model, and remembering that each N-Gram is a feature, the most complex

model is the SVM since, in order to achieve these results, it was required to use unigrams, bigrams,

trigrams, and the training vocabulary was reduced to 9 500, while the LR model achieved similar results

using unigrams and only 1 000 of the unigrams in the training set were used. Given that parsimonious

models are preferable and that the LR required less features, this model is considered as the best

one. Since there is no statistical difference between the performance of the LR and the MNB models

they need to be compared. Regarding the hyper-parameters, the MNB model using tf-idf features is

more complex than the LR model since it uses unigrams, bigrams and the training vocabulary was only

reduced to 29 000. Given this, the best model under the Textual and All Numeric hypothesis is the LR

model showing improvements of 5.1%, and 7.6% in AUC-ROC and Sensitivity, and a decline of −0.8% in

Specificity when compared to the SVM model with tf-idf under the Textual and baseline hypothesis.

Table 5.6: The uncertainty associated with the candidate models under the Textual and All NumericHypothesis.

Textual and All Numeric Hypothesis


LR BoW 86.0%− 85.1% = 1.1% 78.9%− 75.8% = 3.1% 76.6%− 74.4% = 2.2%

SVM BoW 86.4%− 85.6% = 0.8% 79.9%− 74.7% = 5.2% 78.7%− 73.7% = 5.0%

LR tf-idf 86.4%− 85.6% = 0.8% 77.9%− 75.3% = 2.6% 74.6%− 72.3% = 2.3%

SVM tf-idf 85.9%− 85.1% = 0.8% 77.4%− 74.9% = 2.5% 76.9%− 74.6% = 2.3%

In order to best understand the variations regarding the performance of each model across all of the

considered hypothesis it is illustrated in figure 5.5 six radar plots, where each plot represents a learning

algorithm and its performance for a given hypothesis. It was necessary to partition the hypothesis that

considered textual data into two due to the different feature extraction approaches that were used.

67

Table 5.7: McNemar’s Hypothesis Test results with respect to pairs of machine learning models thatfailed to reject the null hypothesis given a significance level of 5% for the 5 predictor sets.

Model 1 Model 2 P-Value Hypothesis

MNB CNB 1.000 Baseline

MNB CNB 1.000 All Numeric

LR BoN-Grams SVM TF-IDF 0.459TextualLR SGD BoN-Grams SVM TF-IDF 0.120

MNB BoN-Grams LR SGD TF-IDF 0.281

LR BoN-Grams LR TF-IDF 0.210

Textual and BaselineLR SGD BoN-Grams MNB TF-IDF 0.715LR SGD BoN-Grams CNB TF-IDF 0.727

LR SGD TF-IDF SVM TF-IDF 0.149MNB TF-IDF CNB TF-IDF 0.994

MNB BoN-Grams CNB BoN-Grams 1.000

Textual and All NumericSVM SGD BoN-Grams SVM SGD TF-IDF 0.893SVM SGD BoN-Grams CNB TF-IDF 0.227

LR TF-IDF MNB TF-IDF 0.486LR - Logistic Regression; SGD - Stochastic Gradient Descent; MNB - Multinomial Naive Bayes; CNB - Comple-ment Naive Bayes; SVM - Support Vector Machine; BoN-Grams - Bag of N-Grams; TF-IDF - Term Frequency-Inverse Document Frequency

68

BaselineAll NumericTextual BoNGramTextual tf-idf

Textual Baseline BoNGramTextual Baseline tf-idfTextual All Numeric BoNGramTextual All Numeric tf-idf

AUC

SensitivitySpecificity

F1-Score

Precision Kappa

0.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.0750.150.2250.3

0.05

0.1

0.15

0.2

0.05

0.1

0.15

0.2

(a) Logistic Regression models for each hypothesis.

AUC


F1-Score

Precision Kappa

0.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.0750.150.2250.3

0.05

0.1

0.15

0.2

0.05

0.1

0.15

0.2

(b) Logistic Regression models, using StochasticGradient Descent, for each hypothesis.

AUC


F1-Score

Precision Kappa

0.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.0750.150.2250.3

0.05

0.1

0.15

0.2

0.05

0.1

0.15

0.2

(c) Multinomial Naive Bayes models for each hypothesis.

AUC


F1-Score

Precision Kappa

0.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.0750.150.2250.3

0.05

0.1

0.15

0.2

0.05

0.1

0.15

0.2

(d) Complement Naive Bayes models for eachhypothesis.

AUC


F1-Score

Precision Kappa

0.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.0750.150.2250.3

0.05

0.1

0.15

0.2

0.05

0.1

0.15

0.2

(e) Support Vector Machine models for each hypothesis.

AUC


F1-Score

Precision Kappa

0.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.0750.150.2250.3

0.05

0.1

0.15

0.2

0.05

0.1

0.15

0.2

(f) Support Vector Machine models, using StochasticGradient Descent, for each hypothesis.

Figure 5.5: Comparison between the performance of the Machine Learning models according to eachof the hypotheses.

69

Table 5.8: Results for the machine learning models in test using the main chief complaint and all of numerical variables with their respective hyper-parameters.Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].

Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’s

Kappa

Bag

ofN

-Gra

ms LR

C = 0.01

Regularization: L2

N-Gram Range: (1, 3)

max features: 25 000

0.856

[0.851, 0.860]

0.778

[0.758, 0.789]

0.748

[0.744, 0.766]

0.261

[0.256, 0.270]

0.157

[0.153, 0.164]

0.184

[0.179, 0.194]

LRS

GD

λ = 10−5

Adaptive Learning

Rate

η0 = 0.0001

Regularization: L1

Number of Iterations:

107


max features: All

Vocabulary

0.842

[0.838, 0.846]

0.762

[0.740, 0.769]

0.734

[0.730, 0.751]

0.246

[0.242, 0.255]

0.147

[0.144, 0.153]

0.167

[0.163, 0.177]

MN

B

α = 0.1


max features: All

Vocabulary

0.826

[0.822, 0.830]

0.740

[0.731, 0.754]

0.740

[0.726, 0.746]

0.244

[0.237, 0.248]

0.146

[0.141, 0.149]

0.165

[0.157, 0.169]



Kappa

CN

Bα = 0.1


max features: All

Vocabulary

0.826

[0.822, 0.830]

0.740

[0.731, 0.754]

0.740

[0.726, 0.746]

0.244

[0.237, 0.248]

0.146

[0.141, 0.149]

0.165

[0.157, 0.169]

SV

M

C = 0.001

Regularization: L2



0.860

[0.856, 0.864]

0.792

[0.747, 0.799]

0.740

[0.737, 0.787]

0.259

[0.253, 0.283]

0.155

[0.151, 0.175]

0.181

[0.176, 0.211]

SV

MS

GD

λ = 10−5

Adaptive Learning

Rate

η0 = 0.01

Regularization: L2


107

Number iterations no

change: 10


max features: All

Vocabulary

0.840

[0.836, 0.844]

0.763

[0.730, 0.778]

0.724

[0.712, 0.757]

0.240

[0.238, 0.254]

0.143

[0.123, 0.167]

0.160

[0.125, 0.170]

71



KappaTF

-IDF

LRC = 0.1

Regularization: L2


max features: 1000

0.842

[0.838, 0.846]

0.768

[0.753, 0.779]

0.731

[0.723, 0.746]

0.246

[0.241, 0.254]

0.147

[0.143, 0.152]

0.167

[0.163, 0.176]

LRS

GD

λ = 10−5

Adaptive Learning

Rate

η0 = 0.001

Regularization: L1


107


max features: 1000

0.829

[0.826, 0.833]

0.772

[0.745, 0.782]

0.698

[0.692, 0.723]

0.227

[0.223, 0.236]

0.133

[0.130, 0.140]

0.144

[0.140, 0.154]

MN

B

α = 0.1


max features: 29000

0.832

[0.828, 0.836]

0.757

[0.745, 0.771]

0.731

[0.722, 0.740]

0.243

[0.237, 0.249]

0.145

[0.140, 0.149]

0.163

[0.158, 0.170]

CN

B

α = 0.1


max features: 29000

0.831

[0.828, 0.835]

0.762

[0.739, 0.777]

0.723

[0.710, 0.740]

0.239

[0.233, 0.246]

0.142

[0.137, 0.148]

0.159

[0.152, 0.167]

72



Kappa

SV

MC = 0.01

Regularization: L2


max features: 9500

0.855

[0.851, 0.859]

0.754

[0.749, 0.774]

0.766

[0.746, 0.769]

0.267

[0.255, 0.272]

0.162

[0.153, 0.165]

0.191

[0.178, 0.196]

SV

MS

GD

λ = 10−5

Adaptive Learning

Rate

η0 = 0.001

Regularization: L1


107


change: 10


max features: 1000

0.824

[0.820, 0.827]

0.738

[0.703, 0.766]

0.716

[0.690, 0.746]

0.229

[0.219, 0.241]

0.136

[0.128, 0.145]

0.148

[0.136, 0.162]

LR - Logistic Regression; SGD - Stochastic Gradient Descent; MNB - Multinomial Naive Bayes; CNB - Complement Naive Bayes; SVM - Support Vector Machine; AUC-ROC

- Area Under Curve-Receiver Operating Characteristic; TF-IDF - Term Frequency-Inverse Document Frequency; η0 - Initial Learning Rate; α - Additive Smoothing; C - Inverse

of Regularization Strength; λ - Regularization term

73

Chapter 6

Conclusion

6.1 Summary

In this work, ML techniques were applied to develop a clinical decision support system with the

purpose of predicting adult patient’s ED revisits within 72 hours after discharge. Patients’ physiological

variables, and demographic and chief main complaint information, available at the time of triage, were

used for modeling. In order to extract knowledge, within the chief main complaint, a NLP framework was

developed to preprocess the textual data.

During the data modeling step, several ML models were developed for the purpose of identifying

the most suitable one. To understand the influence on the predictive model performance, five hy-

potheses regarding predictors were made. Two of those hypotheses only contemplated numerical data

(Baseline and All Numeric). One of the hypothesis only used the chief main complaint textual informa-

tion (Textual). The remaining two hypothesis considered both numerical and textual data (Textual and

Baseline and Textual and All Numeric). When the chief main complaint was used as a predictor, two

feature extraction techniques (BoW and tf-idf) were considered with the goal of discern between them.

Due to the imbalance between classes, the models were developed under a cost-sensitive strategy in

order to make the algorithm aware of the imbalance.

In order to avoid model overfitting, the dataset was partitioned into two sets in a randomized and

stratified fashion. One set was used for model development and the other for model testing against

unseen data. Regularization techniques were taken into account and the models were developed using

10-fold cross-validation. Randomized Search was used during the hyper-parameter tuning step to find

the best set of hyper-parameters that maximized the ROC-AUC score. Given the existence of several

probability thresholds that can discriminate between classes, the Youden’s Index was considered in order

to determine the cut-off probability that guarantees a compromise between Sensitivity and Specificity.

During the model assessment phase, several performance metrics were considered to best describe

75

the quality of the model. Due to the stochastic nature of these statistical techniques, 95% confidence

intervals were computed, using bootstrapping, to take into account the uncertainty regarding these mod-

els. As a means to compare ML models, one resorted to McNemars’ Hypothesis Test in order to verify

whether or not the error rate between two models is statistically significant, given a 5% significance level.

6.2 Final Remarks

As already stated, several ML algorithms and sets of predictors were considered in this study and

were compared.

Regarding the Baseline hypothesis, the models developed under this hypothesis were not capable

to discriminate the true state for each patient in the cohort. This is noticeable since the best model, LR

with SGD, verifies a ROC-AUC, Sensitivity, Specificity, F1-Score, Precision and Cohen’s Kappa (κ) of

0.609, 0.544, 0.615, 0.146, 0.0837, and 0.0517, respectively, as shown in table 6.1.

Under the All Numeric hypothesis, the performance of the developed models improves and the

results regarding the best model, LR, are illustrated in table 6.1. The performance measures whose

impact was more significant were the ROC-AUC and Sensitivity, with improvements of 0.173 and 0.187,

respectively. Despite the improvements, there exists a gap between Sensitivity and Specificity since

Sensitivity > Specificity.

Considering the Textual hypothesis, there was a slight improvement regarding ROC-AUC but there

still exists a gap between Sensitivity and Specificity. Unlike the previous hypothesis, under the present

one Specificity > Sensitivity for the best model: LR with tf-idf features.

The relation between Sensitivity and Specificity, in the two previous hypotheses, suggests that using

only numerical data generates models more capable to correctly identify patients that will revisit the ED.

When only the chief main complaint is used, the models better predict patients that will not revisit the ED.

This implies that using both numerical and textual data generates models that are capable of predicting

both classes.

Given the previous statement, appraising the results of the best model under the Textual and Base-

line hypothesis show little to no improvements regarding model’s prediction capability, when compared

with the best model under the Textual hypothesis. Like the previous hypotheses, there exists a gap

between Sensitivity and Specificity concerning the present hypothesis.

Assessing the last hypothesis, Textual and All Numeric, there is a significant increase regarding

ROC-AUC, 0.842 corresponding to an increase of 5, 1%, when compared with the previous hypothesis,

alongside with an achievement of a compromise between Sensitivity and Specificity. Despite the high

values concerning Sensitivity and Specificity, the model’s Precision is low. This means that the model

is capable of predicting the minority class (Revisit) with few False Negatives but the model classifies

several patients belonging to the Revisits class when they did not revisited the ED, i.e. the model

76

produces several False Positives.

Table 6.1: Best results for each hypothesis.

Hypothesis Model ROC-AUC Sensitivity Specificity F1-Score Precision k

HB LR SGD 0.609 0.544 0.615 0.146 0.0837 0.0517

HAN LR 0.782 0.731 0.674 0.204 0.118 0.118

HT LR tf-idf 0.789 0.688 0.741 0.230 0.138 0.149

HTB SVM tf-idf 0.791 0.692 0.739 0.230 0.138 0.149

HTAN LR tf-idf 0.842 0.768 0.731 0.246 0.147 0.167

HB - Baseline; HAN - All Numeric; HT - Textual; HTB - Textual and Baseline; HTAN - Textual and All Numeric;

k - Cohen’s Kappa

Based on all the developed work and what was now stated, one is now capable to answer the three

research questions:

• RQ1 - Does the textual data increase the revisits prediction power?

Answer: Given the obtained results, textual data enhances the prediction of adult patient ED revis-

its within 72 hours after discharge.

• RQ2 - Which ML model and textual feature extraction technique are most suitable for predicting

those group-risk patients?

Answer: From the results, the most suited ML algorithm is the LR when using tf-idf features.

• RQ3 - Which features best describe the risk of ED revisits?

Answer: The hypothesis that had the combination of features that resulted in best results is the

Textual and All Numeric hypothesis.

In accordance with the results and conclusions, there existed relevant information in the main chief

complaint that led to the indication that a patient belonged to a group-risk. Using this data source

alongside patient information, gathered at the moment of triage, resulted in models with good predictive

performance. This work contributed to advance knowledge within the area and indicated a promising

way to develop clinical decision support systems to predict adult patient ED revisits within 72 hours after

discharge.

77

6.3 Future Work

Since the topic of predicting ED revisits is not vastly explored in the literature, there are still oppor-

tunities for further enhancements. The next paragraphs include some of the author’s suggestions for

future work.

Regarding textual feature extraction techniques, it would be interesting to develop a Word Embedding

representation to capture semantic similarity beyond the trivial level of considering language models

based on N-Grams. One could resort to the Word2Vec [106] or Doc2Vec [107] approaches. This word

representations can be trained using several biomedical texts like published works and books.

Considering the low performance regarding Precision (due to class imbalance), it would be intriguing

to consider ensemble methods like bagging or boosting. Ensemble methods improves the prediction ca-

pability of ML methods by combining multiple weak learners. Using ensemble methods allows to produce

better predictions compared to a single model. Another suggestion falls into using Fuzzy Fingerprinting

[108] due to its promising results in authorship identification.

Given the recent advances in the fields of ML and AI, it is appealing to resort to Deep Learning

methods like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), e.g. Long

Short-Term Memory Networks (LSTMNs). CNNs are good with hierarchical or spatial data and extracting

unlabeled features like written characters. LSTMNs are good at temporal or sequential data, like words

in a body of text. LSTMNs are a variant of RNNs that allow for controlling how much of prior training

data should be remembered. One could also combine both networks and develop a CNNs+LSTMNs

architecture which involves using the CNNs layers for feature extraction on input data combined with

LSTMNs to support sequence prediction.

78

Bibliography

[1] S. Ram, W. Zhang, M. Williams, and Y. Pengetnze, “Predicting asthma-related emergency depart-

ment visits using big data,” IEEE Journal of Biomedical and Health Informatics, vol. 19, pp. 1216–

1223, July 2015.

[2] M. Adibuzzaman, P. DeLaurentis, J. Hill, and B. Benneyworth, “Big data in healthcare - the

promises, challenges and opportunities from a research perspective: A case study with a model

database,” AMIA Annual Symposium proceedings. AMIA Symposium, vol. 2017, pp. 384–392,

April 2018.

[3] V. Yadav, M. Verma, and V. D. Kaushik, “Big data analytics for health systems,” in 2015 Interna-

tional Conference on Green Computing and Internet of Things (ICGCIoT), pp. 253–258, October

2015.

[4] M. Wozniak, B. Cyganek, M. Grana, B. Krawczyk, A. Kasprzak, P. Porwik, and K. Walkowiak,

“A survey of big data issues in electronic health record analysis,” Applied Artificial Intelligence,

vol. 30, pp. 497–520, 07 2016.

[5] H.-J. Kong, “Managing unstructured big data in healthcare system,” Healthcare Informatics Re-

search, vol. 25, p. 1, 01 2019.

[6] C. Kruse, R. Goswamy, Y. Raval, and S. Marawi, “Challenges and opportunities of big data in

health care: A systematic review,” JMIR Medical Informatics, vol. 4, p. e38, 11 2016.

[7] S. Meystre, G. Savova, K. Kipper-Schuler, and J. Hurdle, “Extracting information from textual doc-

uments in the electronic health record: A review of recent research,” Yearb Med Inform, pp. 128–

144, 11 2007.

[8] C. Martin-Gill and R. C. Reiser, “Risk factors for 72-hour admission to the ed,” The American

Journal of Emergency Medicine, vol. 22, no. 6, pp. 448 – 453, 2004.

[9] C.-L. Wu, F.-T. Wang, Y.-C. Chiang, Y.-F. Chiu, T.-G. Lin, L.-F. Fu, and T.-L. Tsai, “Unplanned

emergency department revisits within 72 hours to a secondary teaching referral hospital in taiwan,”

The Journal of emergency medicine, vol. 38, pp. 512–7, 11 2008.

79

[10] S. Verelst, S. Pierloot, D. Desruelles, J.-B. Gillet, and J. Bergs, “Short-term unscheduled return

visits of adult patients to the emergency department,” The Journal of Emergency Medicine, vol. 47,

pp. 131–139, 08 2014.

[11] S.-Y. Cheng, H.-T. Wang, C.-W. Lee, T.-C. Tsai, C.-W. Hung, and K.-H. Wu, “The characteristics

and prognostic predictors of unplanned hospital admission within 72 hours after ed discharge,”

The American journal of emergency medicine, vol. 31, 09 2013.

[12] J. A. Gordon, L. C. An, R. A. Hayward, and B. C. Williams, “Initial emergency department diagnosis

and return visits: Risk versus perception,” Annals of Emergency Medicine, vol. 32, no. 5, pp. 569

– 573, 1998.

[13] B. Graham, R. Bond, M. Quinn, and M. Mulvenna, “Using data mining to predict hospital admis-

sions from the emergency department,” IEEE Access, vol. 6, pp. 10458–10469, 2018.

[14] S.-C. Hu, “Analysis of patient revisits to the emergency department,” The American Journal of

Emergency Medicine, vol. 10, no. 4, pp. 366 – 370, 1992.

[15] E. B. Kulstad, R. Sikka, R. T. Sweis, K. M. Kelley, and K. H. Rzechula, “Ed overcrowding is as-

sociated with an increased frequency of medication errors,” The American Journal of Emergency

Medicine, vol. 28, no. 3, pp. 304 – 309, 2010.

[16] C. van Walraven, I. A Dhalla, C. Bell, E. Etchells, I. G Stiell, K. Zarnke, P. Austin, and A. J Forster,

“Derivation and validation of an index to predict early death or unplanned readmission after dis-

charge from hospital to the community,” CMAJ : Canadian Medical Association journal = journal

de l’Association medicale canadienne, vol. 182, pp. 551–7, 03 2010.

[17] F. Ferreira, “Serial evaluation of the sofa score to predict outcome in critically ill patients,” JAMA,

vol. 286, p. 1754, 10 2001.

[18] Z. Obermeyer and E. J. Emanuel, “Predicting the future — big data, machine learning, and clinical

medicine,” The New England journal of medicine, vol. 375, pp. 1216–1219, 09 2016.

[19] E. K. Lee, F. Yuan, D. A. Hirsh, M. D. Mallory, and H. K. Simon, “A clinical decision tool for predict-

ing patient care characteristics: patients returning within 72 hours in the emergency department,”

AMIA Annu Symp Proc, vol. 2012, pp. 495–504, 2012.

[20] D. Hooijenga, R. Phan, V. Augusto, X. Xie, and A. Redjaline, “Discriminant analysis and feature

selection for emergency department readmission prediction,” pp. 836–842, 11 2018.

[21] F. Meng, K. L. Teow, K. Wee Sheng Teo, C. Kheong Ooi, and S.-Y. Tay, “Predicting 72-hour reatten-

dance in emergency departments using discriminant analysis via mixed integer programming with

electronic medical records,” Journal of Industrial & Management Optimization, vol. 15, pp. 947–

962, 04 2019.

[22] G. Pellerin, K. Gao, and L. Kaminsky, “Predicting 72-hour emergency department revisits,” The

American Journal of Emergency Medicine, vol. 36, no. 3, pp. 420 – 424, 2018.

80

[23] W. S. Hong, A. D. Haimovich, and R. A. Taylor, “Predicting hospital admission at emergency

department triage using machine learning,” in PloS one, 2018.

[24] O. M. Araz, D. Olson, and A. Ramirez-Nafarrate, “Predictive analytics for hospital admissions

from the emergency department using triage information,” International Journal of Production Eco-

nomics, vol. 208, pp. 199 – 207, 2019.

[25] X. Zhang, J. Kim, R. Patzer, S. Pitts, A. Patzer, and J. Schrager, “Prediction of emergency depart-

ment hospital admission based on natural language processing and neural networks*,” Methods

of Information in Medicine, vol. 56, 08 2017.

[26] F. Lucini, F. Fogliatto, G. da Silveira, J. Neyeloff, M. Anzanello, R. Kuchenbecker, and B. D. Schaan,

“Text mining approach to predict hospital admissions using early medical records from the emer-

gency department,” International Journal of Medical Informatics, vol. 100, 01 2017.

[27] D. Teubner, J. Considine, P. Hakendorf, S. Kim, and A. D Bersten, “Model to predict inpatient

mortality from information gathered at presentation to an emergency department: The triage in-

formation mortality model (timm),” Emergency medicine Australasia : EMA, vol. 27, 07 2015.

[28] R. A. Taylor, J. R. Pare, A. K. Venkatesh, H. Mowafi, E. R. Melnick, W. Fleischman, and M. K. Hall,

“Prediction of in-hospital mortality in emergency department patients with sepsis: A local big data-

driven, machine learning approach,” Academic Emergency Medicine, vol. 23, no. 3, pp. 269–278,

2016.

[29] W. Chapman, J. Dowling, and M. M Wagner, “Classification of emergency department chief com-

plaints into 7 syndromes: A retrospective analysis of 527,228 patients,” Annals of emergency

medicine, vol. 46, pp. 445–55, 12 2005.

[30] L. Christensen, P. Haug, and M. Fiszman, “Mplus: a probabilistic medical language understand-

ing system,” in Proceedings of the ACL-02 Workshop on Natural Language Processing in the

Biomedical Domain, (Phildadelphia, Pennsylvania, USA), p. 29–36, Association for Computational

Linguistics, Association for Computational Linguistics, July 2002.

[31] D. Thompson, D. Eitel, C. Fernandes, J. Pines, J. Amsterdam, and S. J Davidson, “Coded chief

complaints-automated analysis of free-text complaints,” Academic emergency medicine : official

journal of the Society for Academic Emergency Medicine, vol. 13, pp. 774–82, 08 2006.

[32] Y. Jernite and Y. Halpern, “Predicting chief complaints at triage time in the emergency department,”

2013.

[33] W. Chapman, W. Bridewell, P. Hanbury, G. F. Cooper, and B. Buchanan, “A simple algorithm for

identifying negated findings and diseases in discharge summaries,” Journal of Biomedical Infor-

matics, vol. 34, pp. 301–310, 11 2001.

[34] D. A. Travers and S. W. Haas, “Using nurses’ natural language entries to build a concept-oriented

terminology for patients’ chief complaints in the emergency department,” Journal of Biomedical

81

Informatics, vol. 36, no. 4, pp. 260 – 270, 2003. Building Nursing Knowledge through Informatics:

From Concept Representation to Data Mining.

[35] O. Bodenreider, “The Unified Medical Language System (UMLS): integrating biomedical terminol-

ogy,” Nucleic Acids Research, vol. 32, pp. D267–D270, 01 2004.

[36] J. McCarthy and E. A. Feigenbaum, “In memoriam: Arthur samuel - pioneer in machine learning.,”

AI Magazine, vol. 11, no. 3, pp. 10–11, 1990.

[37] A. L. Samuel, “Some studies in machine learning using the game of checkers,” IBM Journal of

Research and Development, vol. 3, pp. 210–229, July 1959.

[38] T. M. Mitchell, Machine learning. McGraw Hill series in computer science, McGraw-Hill, 1997.

[39] M. Khanam, T. Mahboob, W. Imtiaz, H. Abdul Ghafoor, and R. Sehar, “A survey on unsupervised

machine learning algorithms for automation, classification and maintenance,” International Journal

of Computer Applications, vol. 119, pp. 34–39, 06 2015.

[40] A. Saxena, M. Prasad, A. Gupta, N. Bharill, O. P. Patel, A. Tiwari, M. J. Er, W. Ding, and C.-T. Lin,

“A review of clustering techniques and developments,” Neurocomputing, vol. 267, pp. 664 – 681,

2017.

[41] J. Uthayakumar, T. Vengattaraman, and P. Dhavachelvan, “A survey on data compression tech-

niques: From the perspective of data quality, coding schemes, data type and applications,” Journal

of King Saud University - Computer and Information Sciences, 2018.

[42] S. Agrawal and J. Agrawal, “Survey on anomaly detection using data mining techniques,” Procedia

Computer Science, vol. 60, pp. 708 – 713, 2015. Knowledge-Based and Intelligent Information &

Engineering Systems 19th Annual Conference, KES-2015, Singapore, September 2015 Proceed-

ings.

[43] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” CoRR,

vol. cs.AI/9605103, 1996.

[44] R. J. Mcfarlane, “A survey of exploration strategies in reinforcement learning,” 2003.

[45] A. Y. Ng and M. I. Jordan, “On discriminative vs. generative classifiers: A comparison of logistic

regression and naive bayes,” in Advances in Neural Information Processing Systems 14 (T. G.

Dietterich, S. Becker, and Z. Ghahramani, eds.), pp. 841–848, MIT Press, 2002.

[46] V. Vapnik, “Principles of risk minimization for learning theory,” in Advances in Neural Information

Processing Systems 4 (J. E. Moody, S. J. Hanson, and R. P. Lippmann, eds.), pp. 831–838,

Morgan-Kaufmann, 1992.

[47] H. Zhang, “The optimality of naive bayes,” vol. 2, 01 2004.

[48] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. New York, NY,

USA: Cambridge University Press, 2008.

82

[49] G. H. John and P. Langley, “Estimating continuous distributions in bayesian classifiers,” CoRR,

vol. abs/1302.4964, 2013.

[50] J. D. M. Rennie, L. Shih, J. Teevan, and D. R. Karger, “Tackling the poor assumptions of naive

bayes text classifiers,” in Proceedings of the Twentieth International Conference on International

Conference on Machine Learning, ICML’03, pp. 616–623, AAAI Press, 2003.

[51] B. Boser, I. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifier,” Proceed-

ings of the Fifth Annual ACM Workshop on Computational Learning Theory, vol. 5, 08 1996.

[52] V. VAPNIK, “Pattern recognition using generalized portrait method,” Automation and Remote Con-

trol, vol. 24, pp. 774–780, 1963.

[53] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in

Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, (New

York, NY, USA), pp. 144–152, ACM, 1992.

[54] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, pp. 273–297, Sept.

1995.

[55] T. Wen and A. Edelman, “Support vector machine lagrange multipliers and simplex volume de-

compositions,” 10 2000.

[56] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear

classification,” J. Mach. Learn. Res., vol. 9, pp. 1871–1874, June 2008.

[57] S. Pan, J. Wu, X. Zhu, G. Long, and C. Zhang, “Finding the best not the most: regularized loss

minimization subgraph selection for graph classification,” Pattern Recognition, vol. 48, no. 11,

pp. 3783 – 3796, 2015.

[58] D. Kansagara, H. Englander, A. Salanitro, D. Kagen, C. Theobald, M. Freeman, and S. Kripalani,

“Risk prediction models for hospital readmission a systematic review,” JAMA : the journal of the

American Medical Association, vol. 306, pp. 1688–98, 10 2011.

[59] E. Wallace, E. Stuart, N. Vaughan, K. Bennett, T. Fahey, and S. Smith, “Risk prediction models to

predict emergency hospital admission in community-dwelling adults a systematic review,” Medical

care, vol. 52, pp. 751–65, 08 2014.

[60] A. Turkman, “Statistical intervals: A guide for practitioners and researchers, second edition, by

william q. meeker, gerald j. hahn, and louis a. escobar. wiley series in probability and statistics,

published by john wiley & sons, 2017. total number of pages: 35+592. isbn: 978-0-4716-8717-7,”

Journal of Time Series Analysis, vol. 39, 02 2018.

[61] P. R. Cohen, Empirical Methods for Artificial Intelligence. Cambridge, MA, USA: MIT Press, 1995.

[62] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. No. 57 in Monographs on Statistics

and Applied Probability, Boca Raton, Florida, USA: Chapman & Hall/CRC, 1993.

83

[63] J. H. Zar, Biostatistical analysis. Upper Saddle River, N.J. : Prentice-Hall, 1999.

[64] B. Efron, “Nonparametric standard errors and confidence intervals,” The Canadian Journal of

Statistics / La Revue Canadienne de Statistique, vol. 9, no. 2, pp. 139–158, 1981.

[65] T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning

algorithms,” Neural Computation, vol. 10, no. 7, pp. 1895–1923, 1998.

[66] C. W. Morris, Writings on the General Theory of Signs. The Hague: Mouton, 1971.

[67] V. Raskin, “Linguistics and natural language processing,” in Machine Translation: Theoretical and

Methodological Issues, pp. 42–58, University Press, 1987.

[68] R. M. Kempson and A. Cormack, “Ambiguity and quantification,” Linguistics and Philosophy, vol. 4,

pp. 259–309, Jun 1981.

[69] R. Kittredge and J. Lehrberger, Sublanguage: Studies of language in restricted semantic domains.

04 2015.

[70] R. Grishman, “Adaptive information extraction and sublanguage analysis,” 2001.

[71] S. Wolff, “The use of morphosemantic regularities in the medical vocabulary for automatic lexical

coding,” Methods of information in medicine, vol. 23, pp. 195–203, 11 1984.

[72] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural

language processing [review article],” IEEE Computational Intelligence Magazine, vol. 13, pp. 55–

75, 08 2018.

[73] R. Nallapati, B. Zhou, C. N. dos Santos, Caglar Gulcehre, and B. Xiang, “Abstractive text summa-

rization using sequence-to-sequence rnns and beyond,” in CoNLL, 2016.

[74] J. Hirschberg and C. D. Manning, “Advances in natural language processing,” Science, vol. 349,

no. 6245, pp. 261–266, 2015.

[75] M. A. Attia, “Arabic tokenization system,” in Proceedings of the 2007 Workshop on Computational

Approaches to Semitic Languages: Common Issues and Resources, Semitic ’07, (Stroudsburg,

PA, USA), pp. 65–72, Association for Computational Linguistics, 2007.

[76] J. J. Webster and C. Kit, “Tokenization as the initial phase in nlp,” in Proceedings of the 14th Con-

ference on Computational Linguistics - Volume 4, COLING ’92, (Stroudsburg, PA, USA), pp. 1106–

1110, Association for Computational Linguistics, 1992.

[77] B. Allison, D. Guthrie, and L. Guthrie, “Another look at the data sparsity problem,” in Text, Speech

and Dialogue (P. Sojka, I. Kopecek, and K. Pala, eds.), (Berlin, Heidelberg), pp. 327–334, Springer

Berlin Heidelberg, 2006.

[78] N. Barrett and J. Weber, “Building a biomedical tokenizer using the token lattice design pattern

and the adapted viterbi algorithm,” BMC bioinformatics, vol. 12 Suppl 3, p. S1, 06 2011.

84

[79] J. Grana, M. A. Alonso, and M. Vilares, “A common solution for tokenization and part-of-speech

tagging,” in Text, Speech and Dialogue (P. Sojka, I. Kopecek, and K. Pala, eds.), (Berlin, Heidel-

berg), pp. 3–10, Springer Berlin Heidelberg, 2002.

[80] B. Jurish and K.-M. Wurzner, “Word and sentence tokenization with hidden markov models,” JLCL,

vol. 28, pp. 61–83, 01 2013.

[81] F. A. Shamsi and A. Guessoum, “A hidden markov model -based pos tagger for arabic,” 2006.

[82] T. Mizumoto and R. Nagata, “Analyzing the impact of spelling errors on pos-tagging and chunking

in learner english,” in NLP-TEA@IJCNLP, 2017.

[83] L. La, Q. Guo, D. Yang, and Q. Cao, “Improved viterbi algorithm-based hmm2 for chinese words

segmentation,” in 2012 International Conference on Computer Science and Electronics Engineer-

ing, vol. 1, pp. 266–269, March 2012.

[84] K. Min, W. Wilson, and Y.-J. Moon, “Typographical and orthographical spelling error correction,”

04 2019.

[85] L. Boytsov, “Indexing methods for approximate dictionary searching: Comparative analysis,” J.

Exp. Algorithmics, vol. 16, pp. 1.1:1.1–1.1:1.91, May 2011.

[86] J. P. Carvalho and L. Coheur, “Introducing uws - a fuzzy based word similarity function with good

discrimination capability: Preliminary results,” in 2013 IEEE International Conference on Fuzzy

Systems (FUZZ-IEEE), pp. 1–8, July 2013.

[87] W. Winkler, “String comparator metrics and enhanced decision rules in the fellegi-sunter model of

record linkage,” Proceedings of the Section on Survey Research Methods, 01 1990.

[88] E. H. Porter, W. E. Winkler, B. O. T. Census, and B. O. T. Census, “Approximate string comparison

and its effect on an advanced record linkage system,” in Advanced Record Linkage System. U.S.

Bureau of the Census, Research Report, pp. 190–199, 1997.

[89] Y. Wang, J. Qin, and W. Wang, “Efficient approximate entity matching using jaro-winkler distance,”

in Web Information Systems Engineering – WISE 2017 (A. Bouguettaya, Y. Gao, A. Klimenko,

L. Chen, X. Zhang, F. Dzerzhinskiy, W. Jia, S. V. Klimenko, and Q. Li, eds.), (Cham), pp. 231–239,

Springer International Publishing, 2017.

[90] P. Christen, “A comparison of personal name matching: Techniques and practical issues,” in ‘The

Second International Workshop on Mining Complex Data (MCD’06), 12 2006.

[91] K. Dreßler and A.-C. Ngonga Ngomo, “On the efficient execution of bounded jaro-winkler dis-

tances,” 09 2015.

[92] M. Madkour, D. Benhaddou, and C. Tao, “Temporal data representation, normalization, extraction,

and reasoning: A review from clinical domain,” Computer Methods and Programs in Biomedicine,

vol. 128, pp. 52 – 68, 2016.

85

[93] V. M. Orengo and C. Huyck, “A stemming algorithm for the portuguese language,” in Proceedings

Eighth Symposium on String Processing and Information Retrieval, pp. 186–193, Nov 2001.

[94] A. Kulmizev, B. Blankers, J. Bjerva, M. Nissim, G. van Noord, B. Plank, and M. Wieling, “The

power of character n-grams in native language identification,” in The 12th Workshop on Innovative

Use of NLP for Building Educational Applications, pp. 382–389, Association for Computational

Linguistics (ACL), 2017.

[95] B. Gencosman, H. Ozmutlu, and S. Ozmutlu, “Character n-gram application for automatic new

topic identification,” Information Processing & Management, vol. 50, p. 821–856, 11 2014.

[96] A. M Robertson and P. Willett, “Applications of n-grams in textual information systems,” Journal of

Documentation, vol. 54, pp. 48–67, 01 1998.

[97] M. Plus. https://medlineplus.gov/vitalsigns.html. [Online; accessed 06-May-2018].

[98] Medscape. https://emedicine.medscape.com/article/2172054-overview. [Online; accessed

06-May-2018].

[99] V. Salmasi, K. Maheshwari, D. Yang, E. J. Mascha, A. Singh, D. I. Sessler, and A. Kurz, “Relation-

ship between intraoperative hypotension, defined by either reduction from baseline or absolute

thresholds, and acute kidney and myocardial injury after noncardiac surgery a retrospective co-

hort analysis,” Anesthesiology: The Journal of the American Society of Anesthesiologists, vol. 126,

no. 1, pp. 47–65, 2017.

[100] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. Mach. Learn.

Res., vol. 13, pp. 281–305, Feb. 2012.

[101] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear

classification,” J. Mach. Learn. Res., vol. 9, pp. 1871–1874, June 2008.

[102] W. YOUDEN, “Index for rating diagnostic tests,” Cancer, vol. 3, p. 32—35, January 1950.

[103] M. Knapman MN BHScN GCEd RN and A. Bonner, “Overcrowding in medium-volume emergency

departments: Effects of aged patients in emergency departments on wait times for non-emergent

triage-level patients,” International Journal of Nursing Practice, vol. 16, pp. 310 – 317, 06 2010.

[104] A. Guttmann, M. J. Schull, M. J. Vermeulen, and T. A. Stukel, “Association between waiting times

and short term mortality and hospital admission after departure from emergency department:

population based cohort study from ontario, canada,” BMJ, vol. 342, 2011.

[105] S. Vilpert, S. Monod, H. J. Ruedin, J. Maurer, L. Trueb, B. Yersin, and C. J. Bula, “Differences

in triage category, priority level and hospitalization rate between young-old and old-old patients

visiting the emergency department,” in BMC health services research, 2018.

[106] T. Mikolov, G. Corrado, K. Chen, and J. Dean, “Efficient estimation of word representations in

vector space,” pp. 1–12, 01 2013.

86

https://medlineplus.gov/vitalsigns.html

https://emedicine.medscape.com/article/2172054-overview

[107] Q. V. Le and T. Mikolov, “Distributed representations of sentences and documents,” CoRR,

vol. abs/1405.4053, 2014.

[108] N. Homem and J. P. Carvalho, “Authorship identification and author fuzzy “fingerprints”,” in 2011

Annual Meeting of the North American Fuzzy Information Processing Society, pp. 1–6, March

2011.

87

Appendix A

Results

A.1 Baseline Results

89

Table A.1: Results for the machine learning models in test using baseline numerical features with their respectivehyper-parameters. Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].

Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’sKappa

LRC = 0.01

Regularization: L2

0.609[0.603, 0.615]

0.544[0.537, 0.572]

0.615[0.607, 0.659]

0.146[0.141, 0.152]

0.0837[0.0801,0.0884]

0.0517[0.0475,0.0603]

LRS

GD

α = 10−5

Adaptive LearningRate

η0 = 0.001

Regularization: L2

Number of Iterations:107

0.609[0.603, 0.614]

0.567[0.545, 0.578]

0.63[0.62, 0.659]

0.146[0.142, 0.15]

0.0838[0.081,0.0866]

0.0524[0.0486,0.0581]

MN

B

α = 0.0010.583

[0.574, 0.588]0.588

[0.574, 0.592]0.546

[0.502, 0.551]0.129

[0.125, 0.131]

0.072[0.0701,0.0736]

0.0305[0.0278,0.0327]

CN

B

α = 0.0010.583

[0.574, 0.588]0.588

[0.574, 0.592]0.546

[0.502, 0.551]0.129

[0.125, 0.131]

0.072[0.0701,0.0736]

0.0305[0.0278,0.0327]

SV

M C = 0.01

Regularization: L2

0.608[0.603, 0.614]

0.574[0.546, 0.587]

0.617[0.607, 0.648]

0.144[0.14, 0.15]

0.0825[0.0799,0.087]

0.0502[0.0468,0.0572]

SV

MS

GD

α = 0.01


η0 = 0.1

Regularization: L2


Number iterations nochange: 10

0.522[0.515, 0.529]

0.552[0.541, 0.562]

0.479[0.476, 0.482]

0.108[0.105, 0.11]

0.0597[0.058,0.0613]

0.00634[0.00405,0.00838]

LR - Logistic Regression; SGD - Stochastic Gradient Descent; MNB - Multinomial Naive Bayes; CNB - Complement Naive Bayes; SVM -Support Vector Machine; AUC-ROC - Area Under Curve-Receiver Operating Characteristic

A.2 All Numerical Results

91

Table A.2: Results for the machine learning models in test using all of the numerical features with their respectivehyper-parameters. Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].

Hyper-Parameters AUC-ROC Sensitivity Specificity F1-Score PrecisionCohen’sKappa

LRC = 0.01

Regularization: L2

0.782[0.778, 0.785]

0.731[0.72, 0.742]

0.674[0.663, 0.678]

0.204[0.199, 0.207]

0.118[0.115, 0.121]

0.118[0.114, 0.121]

LRS

GD

α = 10−5


η0 = 0.0001

Regularization: L1


0.779[0.775, 0.782]

0.733[0.72, 0.757]

0.664[0.647, 0.675]

0.200[0.195, 0.203]

0.116[0.112, 0.118]

0.113[0.108, 0.117]

MN

B

α = 0.0010.727

[0.723, 0.731]0.704

[0.678, 0.716]0.605

[0.601, 0.625]0.170

[0.166, 0.174]

0.0969[0.094,0.0995]

0.0780[0.0754,0.0822]

CN

B

α = 0.0010.727

[0.723, 0.731]0.704

[0.678, 0.716]0.605

[0.601, 0.625]0.170

[0.166, 0.174]

0.0969[0.094,0.0995]

0.0786[0.0754,0.0822]

SV

M C = 0.0001

Regularization: L2

0.782[0.778, 0.785]

0.734[0.723, 0.768]

0.672[0.644, 0.678]

0.203[0.197, 0.207]

0.118[0.114, 0.121]

0.117[0.111, 0.121]

SV

MS

GD

α = 10−5

Constant LearningRate

η0 = 0.001

Regularization: L1


Number iterations nochange: 10

0.763[0.759, 0.768]

0.696[0.68, 0.721]

0.683[0.659, 0.701]

0.2[0.194, 0.204]

0.117[0.112, 0.12]

0.114[0.107, 0.119]

LR - Logistic Regression; SGD - Stochastic Gradient Descent; MNB - Multinomial Naive Bayes; CNB - Complement Naive Bayes; SVM -Support Vector Machine; AUC-ROC - Area Under Curve-Receiver Operating Characteristic

A.3 Textual Results

93

Table A.3: Results for the machine learning models in test using only the main chief complaint with their respective hyper-parameters. Results for theperformance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].


Kappa

Bag

ofN

-Gra

ms LR

C = 0.01

Regularization: L2


max features: All

Vocabulary

0.786

[0.78, 0.791]

0.698

[0.675, 0.715]

0.725

[0.708, 0.749]

0.223

[0.215, 0.235]

0.133

[0.127, 0.141]

0.141

[0.132, 0.155]

LRS

GD

α = 10−5

Adaptive Learning

Rate

η0 = 0.0001

Regularization: L1


107


max features: All

Vocabulary

0.783

[0.777, 0.788]

0.685

[0.668, 0.701]

0.739

[0.722, 0.754]

0.227

[0.22, 0.236]

0.136

[0.131, 0.143]

0.147

[0.138, 0.156]

MN

B

α = 0.5


max features: 29000

0.762

[0.756, 0.767]

0.673

[0.661, 0.684]

0.735

[0.724, 0.739]

0.221

[0.214, 0.225]

0.132

[0.127, 0.135]

0.139

[0.133, 0.143]



Kappa

CN

Bα = 0.5


max features: 25000

0.764

[0.758, 0.769]

0.687

[0.673, 0.702]

0.713

[0.703, 0.729]

0.213

[0.207, 0.22]

0.126

[0.122, 0.131]

0.13

[0.124, 0.137]

SV

M

C = 0.01

Regularization: L2


max features: All

Vocabulary

0.786

[0.78, 0.791]

0.692

[0.678, 0.709]

0.73

[0.717, 0.744]

0.225

[0.218, 0.232]

0.134

[0.129, 0.14]

0.143

[0.136, 0.152]

SV

MS

GD

α = 10−5

Constant Learning

Rate

η0 = 0.001

Regularization: L1


107


change: 10


max features: All

Vocabulary

0.727

[0.719, 0.733]

0.63

[0.605, 0.645]

0.701

[0.685, 0.743]

0.192

[0.186, 0.206]

0.113

[0.109, 0.124]

0.106

[0.0996,

0.123]

95



KappaTF

-IDF

LRC = 0.01

Regularization: L2


max features: 29000

0.789

[0.783, 0.794]

0.688

[0.677, 0.703]

0.741

[0.729, 0.75]

0.23

[0.223, 0.235]

0.138

[0.133, 0.142]

0.149

[0.142, 0.155]

LRS

GD

α = 10−5

Adaptive Learning

Rate

η0 = 0.0001

Regularization: L1


107


max features: All

Vocabulary

0.789

[0.783, 0.794]

0.69

[0.659, 0.712]

0.733

[0.718, 0.771]

0.228

[0.218, 0.244]

0.136

[0.13, 0.149]

0.147

[0.137, 0.166]

MN

B

α = 0.25


max features: 29000

0.774

[0.768, 0.779]

0.693

[0.67, 0.704]

0.715

[0.711, 0.745]

0.216

[0.211, 0.227]

0.128

[0.124, 0.136]

0.133

[0.128, 0.147]

CN

B

α = 0.25


max features: 29000

0.774

[0.768, 0.779]

0.69

[0.667, 0.703]

0.722

[0.706, 0.743]

0.218

[0.211, 0.227]

0.129

[0.124, 0.137]

0.135

[0.128, 0.146]

96



Kappa

SV

MC = 0.0001

Regularization: L2


max features: 25000

0.788

[0.782, 0.794]

0.686

[0.668, 0.702]

0.748

[0.73, 0.762]

0.231

[0.222, 0.241]

0.139

[0.132, 0.147]

0.151

[0.142, 0.163]

SV

MS

GD

α = 10−5

Constant Learning

Rate

η0 = 0.001

Regularization: L1


107


change: 10


max features: All

Vocabulary

0.742

[0.735, 0.748]

0.655

[0.641, 0.669]

0.689

[0.674, 0.695]

0.191

[0.185, 0.196]

0.111

[0.108, 0.115]

0.104

[0.0987,

0.109]

LR - Logistic Regression; SGD - Stochastic Gradient Descent; MNB - Multinomial Naive Bayes; CNB - Complement Naive Bayes; SVM - Support

Vector Machine; AUC-ROC - Area Under Curve-Receiver Operating Characteristic; TF-IDF - Term Frequency-Inverse Document Frequency

97

A.4 Textual and Baseline Results

99

Table A.4: Results for the machine learning models in test using the main chief complaint and baseline variables with their respective hyper-parameters.Results for the performance metrics are illustrated as Performance [ 95%Bootstrap Confidence Interval ].


Kappa

Bag

ofN

-Gra

ms LR

C = 0.01

Regularization: L2


max features: All

Vocabulary

0.789

[0.783, 0.795]

0.684

[0.673, 0.715]

0.747

[0.717, 0.754]

0.232

[0.218, 0.238]

0.139

[0.129, 0.144]

0.152

[0.136, 0.159]

LRS

GD

α = 10−5

Adaptive Learning

Rate

η0 = 0.5

Regularization: L2


107


max features: 9500

0.783

[0.777, 0.788]

0.69

[0.671, 0.705]

0.725

[0.711, 0.751]

0.22

[0.213, 0.232]

0.131

[0.126, 0.14]

0.138

[0.13, 0.152]

MN

B

α = 0.5



0.764

[0.758, 0.768]

0.675

[0.656, 0.697]

0.73

[0.708, 0.756]

0.219

[0.21, 0.231]

0.13

[0.124, 0.14]

0.137

[0.127, 0.152]



Kappa

CN

Bα = 0.5



0.764

[0.757, 0.768]

0.684

[0.662, 0.7]

0.714

[0.696, 0.743]

0.212

[0.205, 0.222]

0.125

[0.121, 0.141]

0.129

[0.118, 0.141]

SV

M

C = 0.001

Regularization: L2


max features: All

Vocabulary

0.789

[0.783, 0.794]

0.696

[0.675, 0.709]

0.732

[0.718, 0.754]

0.226

[0.22, 0.236]

0.135

[0.131, 0.143]

0.145

[0.138, 0.158]

SV

MS

GD

α = 10−5

Adaptive Learning

Rate

η0 = 0.01

Regularization: L2


107


change: 10


max features: All

Vocabulary

0.733

[0.725, 0.74]

0.625

[0.612, 0.645]

0.722

[0.696, 0.742]

0.2

[0.191, 0.208]

0.119

[0.113, 0.125]

0.116

[0.105, 0.126]

101



KappaTF

-IDF

LRC = 0.1

Regularization: L2



0.791

[0.785, 0.797]

0.686

[0.668, 0.7]

0.746

[0.727, 0.761]

0.232

[0.223, 0.242]

0.139

[0.133, 0.147]

0.152

[0.142, 0.163]

LRS

GD

α = 10−5

Adaptive Learning

Rate

η0 = 0.1

Regularization: L1


107


max features: All

Vocabulary

0.789

[0.784, 0.795]

0.687

[0.663, 0.717]

0.741

[0.713, 0.775]

0.229

[0.218, 0.244]

0.137

[0.129, 0.15]

0.148

[0.135, 0.167]

MN

B

α = 0.1



0.772

[0.766, 0.777]

0.685

[0.668, 0.701]

0.718

[0.702, 0.736]

0.214

[0.208, 0.223]

0.127

[0.122, 0.133]

0.132

[0.124, 0.141]

CN

B

α = 0.1



0.771

[0.765, 0.776]

0.683

[0.686, 0.706]

0.718

[0.692, 0.741]

0.214

[0.206, 0.224]

0.127

[0.121, 0.134]

0.131

[0.121, 0.143]

102



Kappa

SV

MC = 0.01

Regularization: L2



0.791

[0.785, 0.796]

0.692

[0.667, 0.702]

0.739

[0.736, 0.763]

0.23

[0.225, 0.242]

0.138

[0.134, 0.147]

0.149

[0.145, 0.164]

SV

MS

GD

α = 0.5

Constant Learning

Rate

η0 = 0.1

Regularization: L1


107


change: 10


max features: All

Vocabulary

0.729

[0.723, 0.735]

0.625

[0.611, 0.649]

0.712

[0.687, 0.728]

0.195

[0.187, 0.202]

0.115

[0.11, 0.121]

0.11

[0.101, 0.117]

LR - Logistic Regression; SGD - Stochastic Gradient Descent; MNB - Multinomial Naive Bayes; CNB - Complement Naive Bayes; SVM - Support

Vector Machine; AUC-ROC - Area Under Curve-Receiver Operating Characteristic; TF-IDF - Term Frequency-Inverse Document Frequency

103

predicting emergency department revisits using machine ... · for text preprocessing, a natural...

Documents