international journal of medical informaticsmachine learning systems. while in the past machine...

7
Contents lists available at ScienceDirect International Journal of Medical Informatics journal homepage: www.elsevier.com/locate/ijmedinf Identifying incidental findings from radiology reports of trauma patients: An evaluation of automated feature representation methods Gaurav Trivedi a,c, , Charmgil Hong c , Esmaeel R. Dadashzadeh b,d , Robert M. Handzel d , Harry Hochheiser a,b , Shyam Visweswaran a,b a Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States b Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, United States c School of Computing and Information, University of Pittsburgh, Pittsburgh, PA, United States d Department of Surgery, University of Pittsburgh, Pittsburgh, PA, United States ARTICLE INFO Keywords: Automated feature representations Radiology reports Incidental findings Word embeddings Convolutional neural networks ABSTRACT Background: Radiologic imaging of trauma patients often uncovers findings that are unrelated to the trauma. These are termed as incidental findings and identifying them in radiology examination reports is necessary for appropriate follow-up. We developed and evaluated an automated pipeline to identify incidental findings at sentence and section levels in radiology reports of trauma patients. Methods: We created an annotated dataset of 4,181 reports and investigated automated feature representations including traditional word and clinical concept (such as SNOMED CT) representations, as well as word and concept embeddings. We evaluated these representations by using them with traditional classifiers such as lo- gistic regression and with deep learning methods such as convolutional neural networks (CNNs). Results: The best performance was observed using word embeddings with CNNs with F 1 scores of 0.66 and 0.52 at section and sentence levels respectively. The F 1 score was statistically significantly higher for sections com- pared to sentences (Wilcoxon; Z < 0.001, p < 0.05). Compared to using words alone, the addition of SNOMED CT concepts did not improve performance. At the sentence level, the F 1 score improved significantly from 0.46 to 0.52 when using pre-trained embeddings (Wilcoxon; Z < 0.001, p < 0.05). Conclusion: The results show that the best performance was achieved by using embeddings with CNNs at both sentence and section levels. This provides evidence that such a pipeline is capable of accurately identifying incidental findings in radiology reports in an automated manner. 1. Background and motivation Trauma is a leading cause of morbidity and mortality, accounting for an estimated 79,000 deaths each year in ages younger than 45 years [1]. Assessment of injuries in trauma patients relies on extensive radi- ologic imaging that includes whole-body computed tomography (CT) and magnetic resonance imaging (MRI) scans. While invaluable in de- monstrating the extent of injuries, whole-body imaging often uncovers findings – occult masses, lesions, and anatomic anomalies – that are unrelated to the trauma. These unrelated findings are termed as in- cidental findings [2]. They range from an inconsequential renal cyst to a potentially life-threatening lung nodule (see Fig. 1). About 40% of all incidental findings have sinister features that warrant follow-up and treatment [3]. Members of the trauma team are responsible for reading radiology examination reports, identifying incidental findings, asses- sing their clinical significance, and communicating the information to the patient and other physicians. Automated methods to identify in- cidental findings in radiology reports can be invaluable at busy trauma centers. Natural language processing (NLP) techniques enable automatic identification and extraction of information from radiology reports. Applications of NLP to radiology reports include retrieval of reports that contain identification of a specific condition or set of conditions [4], information extraction such as follow-up recommendations [5,6], and summarization [7]. In a recent systematic review, Pons et al. [8] comprehensively catalog NLP applications in radiology reports. A variety of NLP pipelines have been developed to process clinical text reports and extract features. The steps in these pipelines often include https://doi.org/10.1016/j.ijmedinf.2019.05.021 Received 17 September 2018; Received in revised form 7 March 2019; Accepted 21 May 2019 Corresponding author. E-mail addresses: [email protected] (G. Trivedi), [email protected] (C. Hong), [email protected] (E.R. Dadashzadeh), [email protected] (R.M. Handzel), [email protected] (H. Hochheiser), [email protected] (S. Visweswaran). International Journal of Medical Informatics 129 (2019) 81–87 1386-5056/ © 2019 Elsevier B.V. All rights reserved. T

Upload: others

Post on 25-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: International Journal of Medical Informaticsmachine learning systems. While in the past machine learning ap-proaches have employed traditional classification methods like logistic

Contents lists available at ScienceDirect

International Journal of Medical Informatics

journal homepage: www.elsevier.com/locate/ijmedinf

Identifying incidental findings from radiology reports of trauma patients: Anevaluation of automated feature representation methodsGaurav Trivedia,c,⁎, Charmgil Hongc, Esmaeel R. Dadashzadehb,d, Robert M. Handzeld,Harry Hochheisera,b, Shyam Visweswarana,b

a Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United Statesb Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, United Statesc School of Computing and Information, University of Pittsburgh, Pittsburgh, PA, United Statesd Department of Surgery, University of Pittsburgh, Pittsburgh, PA, United States

A R T I C L E I N F O

Keywords:Automated feature representationsRadiology reportsIncidental findingsWord embeddingsConvolutional neural networks

A B S T R A C T

Background: Radiologic imaging of trauma patients often uncovers findings that are unrelated to the trauma.These are termed as incidental findings and identifying them in radiology examination reports is necessary forappropriate follow-up. We developed and evaluated an automated pipeline to identify incidental findings atsentence and section levels in radiology reports of trauma patients.Methods: We created an annotated dataset of 4,181 reports and investigated automated feature representationsincluding traditional word and clinical concept (such as SNOMED CT) representations, as well as word andconcept embeddings. We evaluated these representations by using them with traditional classifiers such as lo-gistic regression and with deep learning methods such as convolutional neural networks (CNNs).Results: The best performance was observed using word embeddings with CNNs with F1 scores of 0.66 and 0.52at section and sentence levels respectively. The F1 score was statistically significantly higher for sections com-pared to sentences (Wilcoxon; Z < 0.001, p < 0.05). Compared to using words alone, the addition of SNOMEDCT concepts did not improve performance. At the sentence level, the F1 score improved significantly from 0.46 to0.52 when using pre-trained embeddings (Wilcoxon; Z < 0.001, p < 0.05).Conclusion: The results show that the best performance was achieved by using embeddings with CNNs at bothsentence and section levels. This provides evidence that such a pipeline is capable of accurately identifyingincidental findings in radiology reports in an automated manner.

1. Background and motivation

Trauma is a leading cause of morbidity and mortality, accountingfor an estimated 79,000 deaths each year in ages younger than 45 years[1]. Assessment of injuries in trauma patients relies on extensive radi-ologic imaging that includes whole-body computed tomography (CT)and magnetic resonance imaging (MRI) scans. While invaluable in de-monstrating the extent of injuries, whole-body imaging often uncoversfindings – occult masses, lesions, and anatomic anomalies – that areunrelated to the trauma. These unrelated findings are termed as in-cidental findings [2]. They range from an inconsequential renal cyst toa potentially life-threatening lung nodule (see Fig. 1). About 40% of allincidental findings have sinister features that warrant follow-up andtreatment [3]. Members of the trauma team are responsible for reading

radiology examination reports, identifying incidental findings, asses-sing their clinical significance, and communicating the information tothe patient and other physicians. Automated methods to identify in-cidental findings in radiology reports can be invaluable at busy traumacenters.

Natural language processing (NLP) techniques enable automaticidentification and extraction of information from radiology reports.Applications of NLP to radiology reports include retrieval of reports thatcontain identification of a specific condition or set of conditions [4],information extraction such as follow-up recommendations [5,6], andsummarization [7]. In a recent systematic review, Pons et al. [8]comprehensively catalog NLP applications in radiology reports. Avariety of NLP pipelines have been developed to process clinical textreports and extract features. The steps in these pipelines often include

https://doi.org/10.1016/j.ijmedinf.2019.05.021Received 17 September 2018; Received in revised form 7 March 2019; Accepted 21 May 2019

⁎ Corresponding author.E-mail addresses: [email protected] (G. Trivedi), [email protected] (C. Hong), [email protected] (E.R. Dadashzadeh),

[email protected] (R.M. Handzel), [email protected] (H. Hochheiser), [email protected] (S. Visweswaran).

International Journal of Medical Informatics 129 (2019) 81–87

1386-5056/ © 2019 Elsevier B.V. All rights reserved.

T

Page 2: International Journal of Medical Informaticsmachine learning systems. While in the past machine learning ap-proaches have employed traditional classification methods like logistic

section and sentence segmentation followed by tokenization and nor-malization. Subsequent steps include enrichment with syntactic (lin-guistic) and semantic (often based on specialized biomedical lexicons)annotations. Traditionally both rule based and machine learningmethods have been applied to the extracted features for classification orinformation extraction tasks. Rule-based systems define a set of con-ditions on features to classify reports. For example, Dutta et al. [5]employed keyword-based rules to identify reports with relevant re-commendations, and Elkin et al. developed term-based rules to identifypneumonia in radiology reports [4]. More recently, machine learningapproaches have been used because though rules are easy to under-stand, they are difficult to maintain and often perform worse thanmachine learning systems. While in the past machine learning ap-proaches have employed traditional classification methods like logisticregression and support vector machines, deep learning methods areincreasingly used because they can identify the terms in free textwithout substantial preprocessing that is needed for traditional classi-fication methods [9]. Cai et al. [10] provide a comprehensive survey ofcommon NLP pipelines and methods that have been applied to radi-ology reports.

Past work in automated identification of incidental findings inradiology reports is scant. In a corpus of 573 radiology reports relatedto thromboembolic diseases, Pham et al. [11] applied support vectormachine and maximum entropy classifiers to identify incidental find-ings. They obtained F1 scores of 0.57 at the report level and 0.80 whenretaining only results and conclusion sections in the report. Using acorpus of 661 radiology reports, Johnson [12] applied a combination ofmachine learning methods and hand-crafted rules to identify incidentalfindings, and obtained a F1 score of 0.69 at the report level. In these

studies, identification was done only at the report level and would needadditional manual steps to identify sentences that represent incidentalfindings.

In this paper, we focus on identifying incidental findings at sentenceand section levels in radiology reports. We explore automated featurerepresentation methods including traditional word and clinical con-cepts (using a standard clinical vocabulary, like SystematizedNomenclature of Medicine – Clinical Terms (SNOMED CT)), as well asword and concept embeddings. Dense vector embeddings used in con-junction with deep neural networks have been shown to be particularlyeffective in NLP tasks [13–15]. A word embedding represents eachword in the corpus by a vector of real numbers [16]. This vector mayencode information regarding that word's meaning derived from thecontext in which it appears [17]. These embeddings may be trained inan unsupervised manner using a large corpus. Clinical concept em-bedding differs from word embedding only in that clinical conceptsreplace words [18]. In the following sections we describe our data andexperimental setup to compare classifier performance using these fea-ture representation methods.

2. Methods

2.1. Data and annotation

We obtained 170,052 radiology reports for trauma patients at amajor academic medical center. The reports were de-identified andstripped of explicit identifiers regarding imaging modalities usingsoftware from DE-ID Data Corp [20]. Using approximate regular ex-pression rules to identify the imaging modality, we estimate that these

Fig. 1. A de-identified radiology report of CT imaging in a patient with trauma. It revealed a nodule in the left lung as an incidental finding (underlined). (Forinterpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

G. Trivedi, et al. International Journal of Medical Informatics 129 (2019) 81–87

82

Page 3: International Journal of Medical Informaticsmachine learning systems. While in the past machine learning ap-proaches have employed traditional classification methods like logistic

include about 47K CT scans, 86K X-ray reports, and 10K ultrasound andMRI reports each. Others included studies from interventional radi-ology, fluoroscopy, nuclear medicine etc.

To create an annotated dataset, two trauma physicians (E.R.D. andR.M.H.) annotated 4,181 radiology reports for incidental findings usinga custom annotation tool. The remaining 165,871 reports constitutedthe unannotated dataset for training embeddings. The annotators se-lected phrases or full sentences as incidental findings, and annotationswere allowed to cross sentence boundaries as needed. Annotators fo-cused on two types of incidental findings that are recommended forfollow-up: lesions suspected to be malignant and arterial aneurysmsmeeting specified size and location criteria. Table 1 provides detailedannotation guidelines that were used by the physicians.

We were motivated by an interest in guiding physicians directly tospecific mentions of incidental findings. Thus, we explored identifica-tion of incidental findings at finer text-resolution levels by processingthe reports to extract sentences and sections. We extracted individualsentences using spaCy (a Python NLP library suitable for large-scaleinformation extraction tasks; https://spacy.io [21]). A sentence waslabeled positive if a phrase in it or the entire sentence was selected bythe annotators. Sections were extracted after applying regular expres-sions to identify section headings. A section was marked positive if itcontained one or more sentences with incidental findings.

2.2. Feature representations

We explored feature representations that consisted of words (Words-only) and words augmented with clinical concepts (Words+Concepts).We also investigated word and concept embeddings, which representeach word or concept as a vector such that similarity among vectorscorrelates with semantic similarity.

For the Words-only representation, we converted all text to lower-case and removed headings, newlines, non-alphanumeric charactersand common English stopwords using the Natural Language Toolkit(NLTK, a toolkit for symbolic and statistical NLP; https://www.nltk.org/ [22]). In Words+Concepts representation, we augmented theWords-only representation with SNOMED CT vocabulary concepts. Weextracted clinical concepts in reports using Noble Coder that auto-matically identifies concepts in free text based on a standard vocabulary(http://noble-tools.dbmi.pitt.edu/ [23]). Table 2 shows example sen-tences where SNOMED CT concepts may be useful in identifying in-cidental findings. We hypothesized that the addition of clinical concepts

would reduce variability in textual features; for example, “cardiome-galy” and “enlarged heart” would be mapped to the same concept.

We used a bag-of-words model with term frequency inverse docu-ment frequency (TF-IDF) [24] to obtain vector representations at sec-tion and sentence levels, for use with traditional machine learningclassifiers. We compared the use of high-dimensional and sparse TF-IDFrepresentation with denser word embeddings for training convolutionalneural networks (CNNs). Word embeddings is a class of unsupervisedmethods to derive dense word vector representations of features from alarge text corpus. We used a word2vec method that is trained by pre-dicting context words from a target word [17] and compared threedifferent schemes for generating these embeddings [25]:

(a) Random: Word-vectors were initialized randomly.(b) Folds-only: The embeddings were created from only the training

dataset. Since the experiments used a 5-fold cross-validationscheme, a distinct embedding was created with each cross-fold.

(c) Pre-trained: The embedding was obtained from the unannotateddataset.

We used Gensim (a Python framework for efficient vector spacemodeling; https://pypi.org/project/gensim/ [26]) to train the wordembeddings. Similarly, we explored the use of concept embeddingswith Word+Concept features. We downloaded SNOMED CT conceptembeddings published by Beam et al. [18], which were trained on alarge corpus covering 108,477 medical concepts.

2.3. Experimental methods

We trained two sets of classifiers to predict sentences and sectionsdescribing incidental findings. The traditional classifiers included NaïveBayes, random forest, logistic regression, and support vector machine(SVM), and were built using scikit-learn (a machine learning library inPython; https://scikit-learn.org/ [27]). We compared them with CNNsusing an architecture described by Kim [25] that is implemented inKeras (a deep learning library in Python; https://keras.io/ [28]).

We used a 5-fold cross-validation scheme to train and evaluateclassifiers. We computed F1 score, precision, recall, and area under theROC curve (AUROC) for each fold. Table 3 shows the user specifiedparameter settings we explored in our experiments. We picked para-meter settings that maximized the F1 score and reported the resultsusing these settings in Section 3. We conducted experiments to comparesentence and section classifiers, to compare Words-only with Words+Concepts feature representations, and to compare pre-trained em-beddings with random embeddings. Wilcoxon and Kruskal-Wallissigned-ranked tests were used to statistically compare the F1-scoresacross the cross-folds.

3. Results

3.1. Annotations

The annotated dataset consisted of 4,181 reports, of which 439(10.5%) contained at least one incidental finding. Table 4 shows the

Table 1Annotation guidelines that are adapted from Sperry et al. [19]. Any lesion ofmalignant potential and any arterial aneurysm that is greater than a specifiedsize was annotated.

LesionsBrain Any solid lesionThyroid Any lesionBone Any osteolytic or osteoblastic lesion, not age-

relatedBreast Any solid lesionLung Any lesionLiver Any heterogeneous lesionKidney Any heterogeneous lesionAdrenal Any lesionPancreas Any lesionOvary Any heterogeneous lesionBladder Any lesionProstate Any lesionIntraperitoneal/retroperitoneal Any free lesion

Arterial AneurysmsThoracic aorta ≥5 cmAbdominal aorta ≥4 cmExternal iliac artery ≥3 cmCommon femoral artery ≥2 cmPopliteal artery ≥1 cm

Table 2Two example sentences where identified SNOMED CT concepts are shown inbold and concepts that may signal incidental findings are shown underlined.The first sentence is an example where the radiologist explicitly identifies anincidental finding. The second sentence illustrates an example where thefinding is incidental only in the context of a trauma patient; in a diagnosticradiological examination this would have been a regular finding.

Incidental note is made of a low-attenuation mass at the medial upper pole of the rightkidney which may represent a renal cyst.

There is a 2 cm, partially calcified nodule in the right lobe of the thyroid gland.

G. Trivedi, et al. International Journal of Medical Informatics 129 (2019) 81–87

83

Page 4: International Journal of Medical Informaticsmachine learning systems. While in the past machine learning ap-proaches have employed traditional classification methods like logistic

distribution of incidental findings at sentence and sections levels in theannotated dataset.

An initial pilot set of 128 radiology reports was annotated by thetwo physicians independently, and the inter-annotator agreement (IAA)measured using Cohen's Kappa statistic [29] was 0.73. After review anddeliberation the annotation guidelines were revised, and a second pilotset of 144 radiology reports was annotated. This resulted in a revisedIAA of 0.83. Each of the remaining 4,053 reports was annotated by asingle physician using the revised annotation scheme.

3.2. Classifier performance

The performance of classifiers at the sentence and section levels interms of F1 score, precision, recall and AUROC are shown in Table 5.

3.2.1. Performance at sentence and section levelsA sentence contained 8.04 words on average (Standard Deviation

6.6), while a section contained 38.14 words on average (SD = 61.9).Traditional classifiers with TF-IDF representation had an average vo-cabulary size of 7.6K and 7.8K words in each training fold for sentencesand sections, respectively. The F1 scores for classifiers over 5 folds onsections was statistically significantly higher than F1 scores for sen-tences (Wilcoxon; Z < 0.001, p < 0.05). The best F1 scores were 0.66and 0.52 for sections and sentences respectively and was obtained usingCNNs with word embeddings. We also observed balanced precision andrecall scores for CNNs.

3.2.2. Comparison of Words-only and Words+Concepts representationsWe counted an average of 2.9 concepts (SD = 2.9) per sentence and

found 27,906 sentences without any concepts. For sections, we countedan average of 13.12 concepts (SD = 23.8) per section and found 4,216empty sections.

There was no statistically significant difference between the F1

scores of Words-only and Words+Concepts representations (Wilcoxon;Z = 31.0, p > 0.5; see Fig. 2). The best F1 scores for sections werearound 0.65 for both representations.

While using the published embedding for SNOMED CT concepts[18], we found that its coverage was low on our dataset and did notcontain embeddings for over a third of the concepts. We did not trainour own concept embeddings in this work.

Table 3Implementation details for feature representations and classifiers.

Representation Classifier Parameter settings

TF-IDF (scikit-learn [27], version 0.20)O

Naïve Bayes (scikit-learn) Bernoulli Naïve Bayes default settingsRandom forest (scikit-learn) Number of trees = 100

Metric = GiniMinimum samples at a leaf node = 20

Logistic regression (scikit-learn) L2 regularization coefficient determined by 3-fold internal cross validationSVM (scikit-learn) L2 regularization coefficient determined by 3-fold internal cross validation

Word2vec(Gensim [26], version 3.6.0)Dimension = {50, 75, 100}Minimum word count = 3Window size = {5, 10, 15}Epochs = 150

CNN(Keras [28], version 2.2.4)

Best settings determined by searching over the following values:Hidden dimension = 75Filter sizes = {3, 5, 7} Number of filters = 25Batch size = 32

Table 4Distribution of positives, words, and concepts (mean ± standard deviation) atsentence and section levels in the annotated dataset. Positives denote the rawcount of sentences or sections that contained one or more incidental findings(along with percentages).

Total Positives Words Concepts

Sentences 110,354 1,276 (1.15%) 8.0 ± 6.6 2.9 ± 2.9Sections 23,302 661 (2.83%) 38.1 ± 61.9 13.1 ± 23.8

Table 5F1 scores, precision, recall and AUROC values at the section and sentence levels from 5-fold cross-validation (mean ± standard deviation). The best F1 scores arehighlighted. Words+Concepts includes both words and extracted SNOMED CT concepts as features.

Results for traditional methods are reported with a bag-of-words model using TF-IDF. CNNs were trained using dense word-embeddings: CNN Random: Embeddingsare randomly initialized. CNN Folds-only: Embeddings are trained on-the-fly using the training set in each cross-validation fold from the annotated dataset. CNN Pre-trained: Embeddings are trained using the unannotated dataset. NC denotes not computed: Pre-trained embeddings were not available for all SNOMED CT concepts.

G. Trivedi, et al. International Journal of Medical Informatics 129 (2019) 81–87

84

Page 5: International Journal of Medical Informaticsmachine learning systems. While in the past machine learning ap-proaches have employed traditional classification methods like logistic

3.2.3. Comparison of word embeddingsWe compared CNNs with word embeddings that were trained using

Random, Folds-only, and Pre-trained initializations. The Pre-trained wordembeddings consisted of 24,034 distinct words (20M total) while theFolds-only embeddings consisted of words ranging from 6,832 to 6,900distinct words (488K to 491K total) across the 5 cross-folds.

On comparing sentences with sections, there was no statisticallysignificant difference in performance across the three embeddingsoverall (Kruskal-Wallis; H = 0.42, p = 0.82). However, at the sentencelevel Pre-trained embedding showed statistically significant improve-ment over Random embedding (Wilcoxon; Z < 0.001, p < 0.05; seeFig. 3), with F1 scores improving from 0.46 to 0.52 respectively.

4. Discussion

High performance automated methods to identify text spans withincidental findings in radiology reports might be particularly useful infreeing members of busy trauma centers to focus on urgent clinicalactivities. We developed and evaluated several feature representationsand classifiers to identify incidental findings at section and sentencelevels in radiology reports of trauma patients. We annotated a corpus ofover 4,000 reports for this task. We compared Words-only and Words+Concepts feature sets, and evaluated their representations using bothtraditional TF-IDF representation as well as word embeddings. In ad-dition to using a much larger training set than prior work (4,187 vs. 661reports [12]), our exploration of multiple approaches provides sig-nificant insight into the problem while suggesting interesting areas forfuture work.

Granularity and potential generalizability: We evaluated performanceat both the section and sentence levels. Section-level performance wascomparable to previously described report-level performance(F1 = 0.66 vs. 0.69 [12]). The feature representations used in our study

eliminate the need for feature curation, presenting the possibility ofeasier transfer to other datasets. Our examination of classification at thelevel of individual sentences and sections was motivated by an interestin guiding physicians directly to specific mentions of incidental find-ings. Although sentence level performance was lower than section levelperformance (F1 = 0.52 vs. 0.66), these results suggest that retrainedclassifiers, perhaps informed by detailed error analysis, might performbetter. An initial error analysis of sentence-level results found difficul-ties with sentence boundary detection. Replacing the basic spaCy sen-tence segmentation tool with a customized version tuned to the idio-syncrasies of clinical narratives might improve performance.

Words vs. concepts: The comparison of Words-only representationwith Words+Concepts explored the utility of adding features fromstandard vocabularies such as SNOMED CT. The limited set of concepts(lesions and aneurysms) found in the annotation guidelines (Table 1)presented the possibility that a few concepts may be highly predictiveof incidental findings. However, the combined Words+Concepts fea-tures did not perform better than the basic Words-Only features. Thecombination of the highly skewed dataset and the wide variation in thedistribution of incidental findings (Table 4) may have contributed tothese results (25% of sentences and 18% of sections had no conceptsidentified). Future work with alternative concept extraction tools andwith alternative vocabularies such as RadLex [30] may improve per-formance when using concepts.

Embeddings: Use of word embeddings with deep learning improvedperformance at the sentence level by a good margin (F1 score improvedfrom 0.46 to 0.52). However, we did not see such improvements at thesection level. This may be due to the limitations of the CNN architecturefor modeling long distance dependencies [31]. Temporal models suchas recurrent neural networks [32,33] with attention mechanisms [34]have the potential to provide better results for longer texts. Pre-trainedembeddings created from 165K reports (24K distinct words) is a

Fig. 2. Comparison of Words-only and Words+Concepts representation. Each cross-fold is shown in the plot, along with the mean and standard deviation denoted byhorizontal bars.

G. Trivedi, et al. International Journal of Medical Informatics 129 (2019) 81–87

85

Page 6: International Journal of Medical Informaticsmachine learning systems. While in the past machine learning ap-proaches have employed traditional classification methods like logistic

relatively small training sample for deep learning. Embeddings trainedon a much larger corpus might yield better performance. Future workmay explore alternative methods of generating embeddings such asGloVe [35].

The contextualized nature of incidental findings will likely presentchallenges to any automated extraction approach. For example, theclassification of an observation of a tumor as incidental might dependon whether or not the patient or their physician was aware of thetumor. Similarly, simple kidney cysts might not be considered in-cidental if they are not serious enough to treat. Given potential dis-agreements between domain experts on these classifications, the per-formance of automated methods will necessarily be limited.

There are several limitations related to the dataset used in thisstudy. It is substantially skewed with about 2% positives, and has widevariation in the number of words and concepts across individual sen-tences and sections. A larger and/or less-skewed dataset might providebetter training data. Up-sampling of positive cases and inclusion of datafrom additional institutions might increase the robustness of thetraining data. As our de-identification processes stripped informationregarding the type of imaging used (e.g., X-ray vs. CT), it is possible thatperformance might differ for different imaging modalities. Examiningthese potential differences might be an interesting area for future work.

5. Conclusion

We developed and evaluated several automated feature re-presentations on the performance of classifiers for the task of identi-fying sentences and sections in radiology reports containing incidentalfindings. The best F1 scores were 0.52 and 0.66 at sentence and sectionlevels respectively, both using word embeddings with CNNs. The in-clusion of concepts from SNOMED CT did not lead to improved per-formance. Enhancements to the feature representations that we

explored in this paper can form the basis of a future tool that will au-tomatically identify and extract incidental findings. Potential clinicaluses of such a tool include automated and accurate communication ofinformation at patient handoffs [36].

Summary points

What was already known on the topic?

• Incidental findings in radiology examination reports can beidentified using machine learning methods.

• Prior work demonstrated proof-of-concepts using small data-sets and also hand-crafted features.

What this study added to our knowledge?

• It is feasible to extract incidental findings at finer levels ofresolution (sentence and section levels) at performancesimilar to that at report-level in prior work.

• Including clinical concepts from SNOMED CT did not result inimprovement in performance.

• Pre-trained word embeddings with convolutional neural net-works (CNNs) hold promise for accurate and efficientidentification of incidental findings.

Authors’ contributions

G.T. and R.M.H. developed the initial concept and obtained pre-liminary data. R.M.H. and E.R.D. annotated the radiology reports forincidental findings. G.T. and C.H. performed the experiments, had full

Fig. 3. Comparison of CNNs with Random, Fold-only and Pre-trained embeddings. Each cross-fold is shown in the plot, along with mean and standard deviationdenoted by horizontal bars.

G. Trivedi, et al. International Journal of Medical Informatics 129 (2019) 81–87

86

Page 7: International Journal of Medical Informaticsmachine learning systems. While in the past machine learning ap-proaches have employed traditional classification methods like logistic

access to the study data and take responsibility for the integrity of thedata and accuracy of the analysis. G.T., E.R.D., H.H., and S.V. draftedthe manuscript. All authors participated in the study design, analysisand interpretation, and provided critical revisions of the manuscript forimportant intellectual content.

Funding

The research reported in this publication was supported by theNational Library of Medicine of the National Institutes of Health underaward number R01LM012095 and a Provost Fellowship in IntelligentSystems at the University of Pittsburgh (awarded to G.T.). The contentof the paper is solely the responsibility of the authors and does notnecessarily represent the official views of the National Institutes ofHealth or the University of Pittsburgh.

Conflicts of interest

The authors do not have any competing interests.

References

[1] C. DiMaggio, P. Ayoung-Chee, M. Shinseki, C. Wilson, G. Marshall, D.C. Lee,S. Wall, S. Maulana, H. Leon Pachter, S. Frangos, Traumatic injury in the UnitedStates: in-patient epidemiology 2000–2013, Injury 47 (2016) 1393–1403.

[2] B. Lumbreras, L. Donat, I. Hernández-Aguado, Incidental findings in imaging di-agnostic tests: a systematic review, Br. J. Radiol. 83 (2010) 276–289PMID:20335439.

[3] E.K. Kroczek, G. Wieners, I. Steffen, T. Lindner, F. Streitparth, B. Hamm,M.H. Maurer, Non-traumatic incidental findings in patients undergoing whole-bodycomputed tomography at initial emergency admission, Emerg. Med. J. 34 (2017)643–646.

[4] P.L. Elkin, D. Froehling, D. Wahner-Roedler, B. Trusko, G. Welsh, H. Ma,A.X. Asatryan, J.I. Tokars, S.T. Rosenbloom, S.H. Brown, NLP-based identificationof pneumonia cases from free-text radiological reports, AMIA Annu. Symp. Proc.(2008) 172–176.

[5] S. Dutta, W.J. Long, D.F.M. Brown, A.T. Reisner, Automated detection using naturallanguage processing of radiologists recommendations for additional imaging ofincidental findings, Ann. Emerg. Med. 62 (2013) 162–169.

[6] L. Oliveira, R. Tellis, Y. Qian, K. Trovato, G. Mankovich, Follow-up recommenda-tion detection on radiology reports with incidental pulmonary nodules, Stud. HealthTechnol. Inform. 216 (2015) 1028.

[7] D.J. Goff, T.W. Loehfelm, Automated radiology report summarization using anopen-source natural language processing pipeline, J. Digit. Imaging 31 (2018)185–192.

[8] E. Pons, L.M.M. Braun, M.G.M. Hunink, J.A. Kors, Natural language processing inradiology: a systematic review, Radiology 279 (2016) 329–343.

[9] M.C. Chen, R.L. Ball, L. Yang, N. Moradzadeh, B.E. Chapman, D.B. Larson,C.P. Langlotz, T.J. Amrhein, M.P. Lungren, Deep learning to classify radiology free-text reports, Radiology 286 (2017) 845–852.

[10] T. Cai, A.A. Giannopoulos, S. Yu, T. Kelil, B. Ripley, K.K. Kumamaru, F.J. Rybicki,D. Mitsouras, Natural language processing technologies in radiology research andclinical applications, Radiographics 36 (2016) 176–191.

[11] A.D. Pham, A. Neveol, T. Lavergne, D. Yasunaga, O. Clement, G. Meyer, R. Morello,A. Burgun, Natural language processing of radiology reports for the detection ofthromboembolic diseases and clinically relevant incidental findings, BMCBioinform. 15 (2014) 266.

[12] E.B. Johnson, Methods in text mining for diagnostic radiology, Case WesternReserve University, 2016 Ph.D. thesis.

[13] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural

language processing (almost) from scratch, J. Mach. Learn. Res. 12 (2011)2493–2537.

[14] Y. Kim, Convolutional neural networks for sentence classification, Proceedings ofthe 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP), Association for Computational Linguistics, 2014, pp. 1746–1751.

[15] Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and newperspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2013) 1798–1828.

[16] Y. Goldberg, A primer on neural network models for natural language processing,CoRR abs/1510.00726 (2015).

[17] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word re-presentations in vector space, CoRR abs/1301.3781 (2013).

[18] A.L. Beam, B. Kompa, I. Fried, N.P. Palmer, X. Shi, T. Cai, I.S. Kohane, Clinicalconcept embeddings learned from massive sources of medical data, CoRR abs/1804.01486 (2018).

[19] J.L. Sperry, M.S. Massaro, R.D. Collage, D.H. Nicholas, R.M. Forsythe, G.A. Watson,G.T. Marshall, L.H. Alarcon, T.R. Billiar, A.B. Peitzman, Incidental radiographicfindings after injury: dedicated attention results in improved capture, documenta-tion, and management, Surgery 148 (2010) 618–624.

[20] D. Gupta, M. Saul, J. Gilbertson, Evaluation of a deidentification (De-Id) softwareengine to share pathology reports and clinical documents for research, Am. J. Clin.Pathol. 121 (2004) 176–186.

[21] M. Honnibal, M. Johnson, An improved non-monotonic transition system for de-pendency parsing, Proceedings of the 2015 Conference on Empirical Methods inNatural Language Processing, Association for Computational Linguistics, Lisbon,Portugal, 2015, pp. 1373–1378.

[22] E. Loper, S. Bird, NLTK: The natural language toolkit, Proceedings of the ACL-02Workshop on Effective Tools and Methodologies for Teaching Natural LanguageProcessing and Computational Linguistics – Volume 1, ETMTNLP '02, Associationfor Computational Linguistics, Stroudsburg, PA, USA, 2002, pp. 63–70.

[23] E. Tseytlin, K. Mitchell, E. Legowski, J. Corrigan, G. Chavan, R.S. Jacobson, NOBLE– Flexible concept recognition for large-scale biomedical natural language proces-sing, BMC Bioinform. 17 (2016) 32.

[24] K.S. Jones, A statistical interpretation of term specificity and its application in re-trieval, J. Doc. 28 (1972) 11–21.

[25] Y. Kim, Convolutional neural networks for sentence classification, CoRR abs/1408.5882 (2014).

[26] R. Řehůřek, P. Sojka, Software framework for topic modelling with large corpora,Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks,ELRA, Valletta, Malta, 2010, pp. 45–50.

[27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: machinelearning in python, J. Mach. Learn. Res. 12 (2011) 2825–2830.

[28] F. Chollet, et al., Keras, (2015) https://keras.io.[29] J. Cohen, A coefficient of agreement for nominal scales, Educ. Psychol. Meas. 20

(1960) 37–46.[30] J.L. Mejino Jr., D.L. Rubin, J.F. Brinkley, Fma-radlex: An application ontology of

radiological anatomy derived from the foundational model of anatomy referenceontology, AMIA Annual Symposium Proceedings, volume, American MedicalInformatics Association, 2008, p. 465.

[31] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network formodelling sentences, arXiv preprint 1404.2188.

[32] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (1997)1735–1780.

[33] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,Y. Bengio, Learning phrase representations using RNN encoder-decoder for statis-tical machine translation, Proceedings of the 2014 Conference on EmpiricalMethods in Natural Language Processing (EMNLP), Association for ComputationalLinguistics, 2014, pp. 1724–1734.

[34] S. Gao, M.T. Young, J.X. Qiu, H.-J. Yoon, J.B. Christian, P.A. Fearn, G.D. Tourassi,A. Ramanthan, Hierarchical attention networks for information extraction fromcancer pathology reports, J. Am. Med. Inform. Assoc. 25 (2018) 321–330.

[35] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representa-tion. in: Proceedings of the 2014 conference on empirical methods in natural lan-guage processing (EMNLP), pp. 1532-1543.

[36] G. Trivedi, Towards interactive natural language processing in clinical care, IEEEInternational Conference on Healthcare Informatics (ICHI) (2018) 448–449.

G. Trivedi, et al. International Journal of Medical Informatics 129 (2019) 81–87

87