le peng, taihui li, ju sun

9
Rethink Transfer Learning in Medical Image Classification Le Peng, 1 Hengyue Liang, 2 Gaoxiang Luo, 1 Taihui Li, 1 Ju Sun 1 1 Department of Computer Science and Engineering, University of Minnesota, USA 2 Department of Electrical and Computer Engineering, University of Minnesota, USA {peng0347, liang656, luo00042, lixx5027, jusun}@umn.edu Abstract Transfer learning (TL) with deep convolutional neural net- works (DCNNs) is crucial for modern medical image clas- sification (MIC). However, the current practice of finetuning the entire pretrained model is puzzling, as most MIC tasks rely only on low- to mid-level features that are learned by up to mid layers of DCNNs. To resolve the puzzle, we perform careful empirical comparisons of several existing deep and shallow models, and propose a novel truncated TL method that consistently leads to comparable or superior performance and compact models on two MIC tasks. Our results highlight the importance of transferring the right level of pretrained vi- sual features commensurate with the intrinsic complexity of the task. 1 Introduction Transfer learning (TL) has become the norm for medical im- age classification (MIC) and segmentation using deep learn- ing. In these tasks, deep convolutional neural networks (DC- NNs) pretrained on large-scale source tasks (e.g., image classification on ImageNet (Deng et al. 2009) and 3D med- ical segmentation and classification (Chen, Ma, and Zheng 2019)) are often fine-tuned as backbone models for the tar- get tasks; see Shin et al. (2016); van Tulder and de Bruijne (2016); Anthimopoulos et al. (2016); Huynh, Li, and Giger (2016); Antropova, Huynh, and Giger (2017); Ghafoorian et al. (2017); Jiang et al. (2018) for successes. The key of TL is feature reuse across the source and target tasks, which leads to practical benefits such as fast convergence in training, and good performance even if the training data in the target task are limited (Yosinski et al. 2014). Pretrained DCNNs extract increasingly more abstract visual features from the bottom to top layers: from low- level textures and edges, to mid-level blobs and parts, and then to high-level shapes and patterns (Zeiler and Fergus 2014). While shape and patterns are crucial for recogniz- ing and segmenting generic objects (see Fig. 1 (left)), med- ical pathologies often take the form of abnormal textures or blobs, which correspond to low- to mid-level features (see Fig. 1 (right)). So, intuitively, for most medical imag- ing tasks, we probably only need to finetune a reasonable number of bottom layers and do not need the top layers. Copyright © 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. However, standard practice for MIC is at odds with this: for most MIC tasks, all layers of pretrained DCNNs are retained and finetuned. It is thus natural to ask how to bridge the gap between our intuition and the practice, and whether the full- transfer strategy is the best possible for MIC. The pioneering work Raghu et al. (2019) partially ad- dresses this on medical image classification (MIC), and empirically shows: in terms of classification performance, i) shallow models can perform on par with deep models, and ii) TL is no better than training from scratch. How- ever, there are serious limitations in their experimentation: 1) Poor evaluation metric. In their tasks, the validation sets are highly imbalanced, i.e., #negatives #positive 1 , and the standard AUC (i.e., area-under-ROC, AUROC) metric that they use is well known to be a poor performance indi- cator in such scenarios; see Section 2 and Lever, Krzywin- ski, and Altman (2016); Saito and Rehmsmeier (2015) and Murphy (2021, Sec 5.1); 2) Abundant training data. They focus on two data-rich tasks—each class has at least thou- sands of training data points, but most MIC tasks have only limited data, when TL could have a unique edge (Shin et al. 2016; Ravishankar et al. 2016; Khalifa et al. 2019; Karimi, Warfield, and Gholipour 2020); and 3) Rigid TL methods. They only evaluate the full-transfer (all layers tuned from pretrained weights) and hybrid-transfer (i.e., Transfusion (Raghu et al. 2019), bottom layers finetuned from pretrained weights while top layers are coarsened and finetuned from random initialization) strategies, but miss great alternatives, as we show below. In this paper, we challenge the conclusions of Raghu et al. (2019), and depict a picture of TL for MIC that is consistent with our intuition. Specifically, measuring performance us- ing both AUROC and AURPC (area-under-precision-recall- curve, which reflects performance well even under class im- balance; see Section 2), we find the following: • For any model, shallow or deep, TL most often outper- forms training from scratch, especially when the training set is small, reaffirming the performance benefit of TL; • The best TL strategy is finetuning truncated pretrained 1 In R-DR identification, positives make up 7.8% and 14.6% of of the two validation sets, respectively (Gulshan et al. 2016). In CheXpert, positive samples make up about 16% only (Irvin et al. 2019). arXiv:2106.05152v4 [eess.IV] 30 Oct 2021

Upload: others

Post on 25-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Le Peng, Taihui Li, Ju Sun

Rethink Transfer Learning in Medical Image Classification

Le Peng,1 Hengyue Liang, 2 Gaoxiang Luo, 1 Taihui Li, 1 Ju Sun 1

1 Department of Computer Science and Engineering, University of Minnesota, USA2 Department of Electrical and Computer Engineering, University of Minnesota, USA

{peng0347, liang656, luo00042, lixx5027, jusun}@umn.edu

AbstractTransfer learning (TL) with deep convolutional neural net-works (DCNNs) is crucial for modern medical image clas-sification (MIC). However, the current practice of finetuningthe entire pretrained model is puzzling, as most MIC tasksrely only on low- to mid-level features that are learned by upto mid layers of DCNNs. To resolve the puzzle, we performcareful empirical comparisons of several existing deep andshallow models, and propose a novel truncated TL methodthat consistently leads to comparable or superior performanceand compact models on two MIC tasks. Our results highlightthe importance of transferring the right level of pretrained vi-sual features commensurate with the intrinsic complexity ofthe task.

1 IntroductionTransfer learning (TL) has become the norm for medical im-age classification (MIC) and segmentation using deep learn-ing. In these tasks, deep convolutional neural networks (DC-NNs) pretrained on large-scale source tasks (e.g., imageclassification on ImageNet (Deng et al. 2009) and 3D med-ical segmentation and classification (Chen, Ma, and Zheng2019)) are often fine-tuned as backbone models for the tar-get tasks; see Shin et al. (2016); van Tulder and de Bruijne(2016); Anthimopoulos et al. (2016); Huynh, Li, and Giger(2016); Antropova, Huynh, and Giger (2017); Ghafoorianet al. (2017); Jiang et al. (2018) for successes.

The key of TL is feature reuse across the source andtarget tasks, which leads to practical benefits such as fastconvergence in training, and good performance even if thetraining data in the target task are limited (Yosinski et al.2014). Pretrained DCNNs extract increasingly more abstractvisual features from the bottom to top layers: from low-level textures and edges, to mid-level blobs and parts, andthen to high-level shapes and patterns (Zeiler and Fergus2014). While shape and patterns are crucial for recogniz-ing and segmenting generic objects (see Fig. 1 (left)), med-ical pathologies often take the form of abnormal texturesor blobs, which correspond to low- to mid-level features(see Fig. 1 (right)). So, intuitively, for most medical imag-ing tasks, we probably only need to finetune a reasonablenumber of bottom layers and do not need the top layers.

Copyright © 2022, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

However, standard practice for MIC is at odds with this: formost MIC tasks, all layers of pretrained DCNNs are retainedand finetuned. It is thus natural to ask how to bridge the gapbetween our intuition and the practice, and whether the full-transfer strategy is the best possible for MIC.

The pioneering work Raghu et al. (2019) partially ad-dresses this on medical image classification (MIC), andempirically shows: in terms of classification performance,i) shallow models can perform on par with deep models,and ii) TL is no better than training from scratch. How-ever, there are serious limitations in their experimentation:1) Poor evaluation metric. In their tasks, the validation setsare highly imbalanced, i.e., #negatives�#positive1, andthe standard AUC (i.e., area-under-ROC, AUROC) metricthat they use is well known to be a poor performance indi-cator in such scenarios; see Section 2 and Lever, Krzywin-ski, and Altman (2016); Saito and Rehmsmeier (2015) andMurphy (2021, Sec 5.1); 2) Abundant training data. Theyfocus on two data-rich tasks—each class has at least thou-sands of training data points, but most MIC tasks have onlylimited data, when TL could have a unique edge (Shin et al.2016; Ravishankar et al. 2016; Khalifa et al. 2019; Karimi,Warfield, and Gholipour 2020); and 3) Rigid TL methods.They only evaluate the full-transfer (all layers tuned frompretrained weights) and hybrid-transfer (i.e., Transfusion(Raghu et al. 2019), bottom layers finetuned from pretrainedweights while top layers are coarsened and finetuned fromrandom initialization) strategies, but miss great alternatives,as we show below.

In this paper, we challenge the conclusions of Raghu et al.(2019), and depict a picture of TL for MIC that is consistentwith our intuition. Specifically, measuring performance us-ing both AUROC and AURPC (area-under-precision-recall-curve, which reflects performance well even under class im-balance; see Section 2), we find the following:

• For any model, shallow or deep, TL most often outper-forms training from scratch, especially when the trainingset is small, reaffirming the performance benefit of TL;

• The best TL strategy is finetuning truncated pretrained

1In R-DR identification, positives make up 7.8% and 14.6% ofof the two validation sets, respectively (Gulshan et al. 2016). InCheXpert, positive samples make up about 16% only (Irvin et al.2019).

arX

iv:2

106.

0515

2v4

[ee

ss.I

V]

30

Oct

202

1

Page 2: Le Peng, Taihui Li, Ju Sun

Figure 1: (left) The feature hierarchy learned by typical DCNNs on computer vision datasets; (right) Examples of medicalpathologies in a chest x-ray. While high-level shapes and patterns are needed to identify the dog on the left, only low-leveltextures and/or mid-level blobs are necessary for detecting the medical pathologies on the right. (the left visualized featuresadapted from Zeiler and Fergus (2014); the right chest x-ray adapted from Nguyen et al. (2021)).

models (i.e., top layers removed) up to layers commen-surate with the level of features needed. This truncatedTL method often leads to substantial accuracy gains andcompact models.

2 Highlights of Key Technical PointsEvaluation metrics: AUPRC vs. AUROCFor binary classification with a dominant negative class—common for MIC, the precision-recall curve (PRC) is wellknown to be more informative and indicative of the trueperformance than the ROC (Lever, Krzywinski, and Altman2016; Saito and Rehmsmeier 2015). Even for mediocre clas-sifiers that can only score the positives above the majority ofthe negatives, the TPRs quickly climb to high values for lowFPRs—dictated by large TNs in early cut-off points, leadingto uniformly high and close AUROCs. So AUROC cannottell good classifiers from mediocre classifiers. In contrast,PRC captures the granularity of performance by includingprecision that is sensitive to the ranking of positives vs. allnegatives, rendering AUPRC a favorable metric in the pres-ence of class imbalance.

To quickly see the gap, consider a dataset consisting of10 positives and 990 negatives for a rare disease. AssumeClassifier A (CA) scores the positives uniformly over thetop 12, and Classifier B (CB) scores the positives uniformlyover the top 30. Intuitively, CA performs much better thanCB, as they detect 1 TP case at the cost of 0.2 and 2 FPcases, respectively—this is captured by precision. Figure 2presents the ROC and PRC curves for CA and CB: both CAand CB perform almost perfectly told from ROCs, but PRCsreveal the substantial gap clearly—while CA is good, CBis unacceptable in medical diagnosis (CA: AUROC=0.999,AUPRC=0.990; CB: AUROC=0.991, AUPRC=0.354).

In our experiments, the medical datasets are imbalanced,if not significantly, and hence we report both AUROC andAUPRC, while the prior work Raghu et al. (2019) only re-ports AUROC. As we show in Section 3, the use of AUPRC

helps us to reaffirm the performance benefit of TL, againstthe conclusion made in Raghu et al. (2019).

Network architectures and training strategiesWe study 4 types of neural networks: (1) Shallow net-works: i.e., the CBR (convolution-batchnorm-relu) familyproposed in Raghu et al. (2019), basic convolutional net-works with 4 (CBR-LargeW, CBR-Small, CBR-Tiny) or5 (CBR-LargeT) convolution layers composed of 5 × 5or 7 × 7 filters; details in Appendix A of Raghu et al.(2019); (2) Deep networks: ResNet50 (He et al. 2016) andDenseNet121 (Huang et al. 2017), both popular in TL forMIC (Bressem et al. 2020; Irvin et al. 2019; Rajpurkar et al.2017); (3) Truncated networks: ResNet50 (He et al. 2016)and DenseNet121 (Huang et al. 2017) truncated at differ-ent levels, including i) Res-T1, Res-T2, Res-T3, Res-T4,Res-T5, which are ResNet50 truncated at the first 25(50%)2,37(74%), 43(86%), 46(92%), and 49(98%) convolutionallayers, respectively; and ii) Dens-T1, Dens-T2, and Dens-T3, which are DenseNet121 truncated at the first 1(12%),2(32%), and 3(88%) dense blocks, respectively. (4) Slimnetworks: Large models such as Densenet121 (Huang et al.2017) with channels in top blocks halved. Dens-S1, Dens-S2, and Dens-S3 denote such coarsening from first 1, 2, and3 dense blocks respectively; For all these models, the finalfully-connected layers are adjusted or appended whenevernecessary.

We compare 4 training strategies: (1) Random initializa-tion (RI): The network is trained with all weights randomlyinitialized; (2) Transfer learning (TL): Pretrained networkweights are finetuned. (3) Transfusion Slim models withbottom layers finetuned from pretrained weights while toplayers are tuned from random initialization (Slim modelwith partial transfer learning) (4) Truncated transfer learn-ing Truncated model with all layers(exclude the final fully-connected layer) finetuned from pretrained weights. All the

2All the percentages here indicate the percentages of retainedlayers out of all layers in the respective deep models.

Page 3: Le Peng, Taihui Li, Ju Sun

0.0 0.5 1.0FPR

0.0

0.5

1.0T

PR

 / R

ecal

l

ROC1ROC2

0.0 0.5 1.0TPR / Recall

0.0

0.5

1.0

Pre

cisi

on

PRC1PRC2

Figure 2: (left) ROC and PRC curves of two classifiers on the same binary classification problem. While the ROC curves areuniformly high and very close, their PRC curves are far apart. (right) Confusion table and definition of relevant terms for ROCand PRC curves.

pretrained models are trained on ImageNet, a standard imageclassification dataset in computer vision (Deng et al. 2009).

Datasets and setupsImage Classification The prior work (Raghu et al. 2019)performs their experiments on CheXpert (Irvin et al. 2019)and Retina (Gulshan et al. 2016), but Retina is private.We evaluate our models and training strategies on CheX-pert (Irvin et al. 2019), and a private COVID-19 chest x-raydataset.

CheXpert consists of 224,316 chest x-rays of 65,240 pa-tients, with binary labels for each of the 13 pathologies foreach x-ray3: No Finding (NF), Enlarged Cardiome (EC),Cardiomegaly (Ca.), Lung Opacity (LO), Lung Lesion (LL),Edema (Ed.), Consolidation (Co.), Pneumonia (Pn.), At-electasis (At.), Pneumothorax (Pt.), Pleural Effusion (PE),Pleural Other (PO), Fracture (Fr.). We randomly divide thedataset into training (160,858), validation (17,875), and test(44,684) sets. The setting is different from that of (Raghuet al. 2019) and we do not directly compare with their re-sults (Raghu et al. 2019), as: 1) they test on the official vali-dation set, which only consists of 200 x-rays and also missesseveral pathologies. We prefer the much larger test set con-structed above for stable and comprehensive evaluation; 2)our purpose is not to obtain or surpass the state-of-the-art re-sults on the dataset, but to evaluate the models and trainingstrategies that we listed above in a consistent manner; and3) they do not provide sufficient training details or codes foreasy reproducibility.

Our COVID-19 dataset comprises 3,998 COVID-positiveand 41,515 COVID-negative frontal chest x-rays. We focuson prospective evaluation: training and validation are per-formed on x-rays dated before June 30th 2020, and test isperformed on x-rays dated after. By this splitting strategy,our training set has 1,777 positives and 29,030 negatives(imbalance ratio 1:16.3), validation set has 444 positives

3We map all “Uncertainty” labels into “0” (negative, i.e., the U-Zeros model in (Irvin et al. 2019)). We omit the “Support Devices”class as it is different from typical medical pathologies.

and 7,257 negatives, and the test set has 1,777 positives and5,228 negatives (imbalance ratio 1:2.94)—the drifting im-balance ratio is a salient feature of pandemic data such asCOVID-19.

All the experiments follow the same training protocol: theinput images are cropped and resized to 224x224 with slightrandom rotation for data augmentation. When training themodel, Adam (Kingma and Ba 2014) is used with an initiallearning rate of 0.00001 and a mini-batch size of 64. Thelearning rate scheduler is also applied to encourage flexi-bility: when the AUPRC on the validation set has stoppedimproving, the learning rate will be decreased by a factor of0.5. Models are trained for 200 epochs and only the modelwith the best validation performance in the training will beevaluated. To account for randomness, each experiment isindependently repeated three times, and the mean and stan-dard deviation of all performance scores are reported.

3 Transfer Learning (TL) or Not?In this section, we compare TL and RI on deep and shal-low models in data-rich and data-poor regimes, respectively.Both CheXpert and COVID-19 dataset contain over 1,000samples per class, and this is relatively high for medicaltasks. So we consider experiments on the original datasetsto be in the data-rich regime. To simulate data-poor regimes,we focus on the COVID-19 dataset and subsample 5% and10% of the original training set.

Data-rich regimeThe scores on CheXpert are presented in Fig. 4 (standarddeviation: ≤ 0.017). While TL and RI perform compara-bly on most pathologies and models, in a number of caseson deep models, TL outperforms RI by significant gaps,e.g., ResNet50 on Pt., TL is 10% above RI, as measuredby AUPRC. Overall, on deep models, TL outperforms RImarginally on most pathologies, and considerably on a num-ber of pathologies, measured by both AUROC and AUPRC.On shallow models, TL and RI are very close by both met-rics.

Page 4: Le Peng, Taihui Li, Ju Sun

The scores on COVID-19 can be read off from Fig. 3 (i.e.,“w. 100% data”; standard deviation ≤ 0.038). We observe asimilar trend to that of CheXpert. TL leads to substantialperformance gains on DenseNet121, and marginal gains oroccasional losses on shallow models.

Overall, on both datasets and in all settings, the best-performing combination is most often TL, not RI, coupledwith a certain model. AUPRC is a crucial tie-breaker whenAUROCs are close.

Data-poor regimeAs we alluded to Section 1, TL is expected to benefitdata-limited learning. To verify this, we simulate two smallCOVID-19 datasets: 1) 5% Data, which contains 1,539cases (88 positives and 1,451 negatives); 2) 10% Data,which consists of 3,080 cases (177 positives and 2903 neg-atives). Our training protocol and test data are exactly thesame as Section 3. The results are included in Fig. 3. Wehave two observations: 1) On all models, except for CBR-LargeT and CBR-LargeW, TL outperforms RI measured inboth AUROC and AUPRC; 2) CBR-Small+TL is a clearwinner for 5% data, whereas for 10% data, AUROC andAUPRC point to CBR-LargeW and DenseNet121, respec-tively. This disparity highlights the need for reporting andcomparing both metrics.

4 Truncated TLWe have confirmed the advantage of TL over RI, especiallyon deep models. It is natural to ask if we could have bet-ter ways of performing TL. Since we probably do not needhigh-level features, it is puzzling why in TL on deep modelswe need to keep the top layers at all. This motivates the trun-cated networks that we proposed in Section 2. In this section,we compare the performance of TL on these truncated mod-els and their parent deep models, as well as selected CBRshallow models. We also compare truncated TL models withtransfusion models (slim model with partial TL) proposedin (Raghu et al. 2019). The training and test protocols fol-low exactly those of Section 3.

Fig. 7 summarizes the results on our COVID-19 dataset.It is evident that Dens-T1 and Dens-T2, which are heav-ily truncated (up to 1/3 of the original depth) ver-sions of DenseNet121, are the top two performant mod-els when combined with TL. In contrast, Dens-T3 andfull DenseNet121 (Dens-F) with less aggressive trunca-tions can be substantially worse, and sometimes even worsethan the shallow models. From COVID-19 medical stud-ies, it is known that salient radiomic features for COVID-19 are opacities and consolidation in the lung area thatonly concern low-level textures and perhaps also mid-levelblobs (Wong et al. 2020; Shi et al. 2020). This is a strongconfirmation that only a reasonable number of bottom lay-ers are needed for efficient TL. It is also worth mentioningthat Dens-S1 and Dens-S3 which yields slightly higher AU-ROC score than Dens-F, shows worse AUPRC performance.This is probably due to the intrinsic imbalance of COVID-19dataset, making the AUROC a week indicator as illustratedin Section 2.

Fig. 5 present the results on CheXpert. For the sake ofsimplicity, we only pick the best model from each networkfamily and compare it with the baseline Dens-F model usingTL. The value is reported as a relative increasing or decreas-ing over the baseline performance (i.e., DenseNet121 withTL is the x-axis). The results shows that Dens-T3 is the bestmodel, although in most cases it is comparable to Dens-F.Note that compared to Dens-T2 or Dens-T1 that exceled onthe COVID-19 dataset, Dens-T3 is far deeper (88% vs. 1/3of the original depth). This disparity can again be explainedby feature hierarchy. In CheXpert, pathologies such as at-electasis and pneumothorax need relatively high-level fea-tures as they start to concern shapes, in contrast to the low-and mid-level features used in COVID-19. Another obser-vation is that on EC, LL, PO, Fr., the AUPRCs are very low(below 20%) although the corresponding AUROCs are allabove 60% or even mostly 70%. These are rare diseases inCheXpert with high imbalance ratios between positives andnegatives (EC: 1:19.8, LL: 1:26.4, PO: 1:75.9, Fr.: 1:24.8).Even for the best models here, the AUROCs may be con-sidered decent, but their actual performance, when precisionis taken into account via AUPRC, is very poor4. This re-inforces our claim that AUPRC needs to be reported whenevaluating classifiers on data with class imbalance.

Fig. 6 compares the Dens-F with the truncated TL andtransfusion models (slim families) on COVID-19 dataset.Read off the figure, truncated models almost always yielda better performance than transfusion models when reusingthe same level of pretrained weights (e.g., up to block1).More specifically, Dens-T1 surpasses the baseline perfor-mance gained from Dens-F with 79% parameter reduction.For both curves, the performance dropped after loading thepretrained weights up to block 3, which confirms our intu-ition that top layers of DCNNs aren’t necessarily learninguseful features. Rather, they may introduce some highly pre-dictive yet brittle features which consequently leads to poorgeneralization. Notably, our truncated models in most casesperform better than TL on full models, while enjoying theadvantage of even faster convergence speed as transfusionmodels.

5 DiscussionIn this paper, we revisit transfer learning (TL) for medicalimaging classification (MIC) on chest X-rays, taking intoaccount characteristics of typical medical datasets such asclass imbalance and limited data. By evaluating different TLstrategies on a number of shallow and deep convolutionalneural network models, we find that 1) TL does benefit theclassification performance, especially in data-poor scenar-ios; 2) only transferring truncated deep pretrained models upto layers commensurate with the level of features needed forthe classification tasks, leads to superior performance com-pared to conventional TL on deep and shallow models. Dur-ing our experimentation, we have also highlighted the roleof AUPRC in distinguishing good classifiers from mediocre

4However, note that it does not make sense to cross compareAUPRCs across different ratios; see discussions in Section 5.1of (Murphy 2021) and (Williams 2021).

Page 5: Le Peng, Taihui Li, Ju Sun

0.74

0.76

0.78

0.80

0.82

0.84

0.86

0.88

0.90AUROC

TL w. 5% DataRI w. 5% DataTL w. 10% DataRI w. 10% DataTL w. 100% DataRI w. 100% Data

DenseNet121 CBR-LargeT CBR-LargeW CBR-Small CBR-Tiny0.50

0.55

0.60

0.65

0.70

0.75AUPRC

Figure 3: TL vs. RI on COVID-19. With the full dataset, TL wins over RI on DenseNet121, and they are close in performanceon shallow models. With 10% and 5% subsampled data only, TL outperforms RI on all but CBR-LargeT and CBR-LargeW.

ones under class imbalance. Our results support that low-and mid-level visual features are probably sufficient for typ-ical MIC, and high-level features are needed only occasion-ally.

Potential future directions include: 1) experimenting withother image modalities, such as CT and MRI images—ifsimilar conclusion holds, the truncated TL strategy can leadto a profound saving of computing resources for model train-ing and inference on 3D medical data; 2) validating theconclusion on segmentation, another major family of imag-ing tasks, and other diseases; 3) investigating TL on mod-els directly trained on medical datasets, e.g., (Chen, Ma,and Zheng 2019) and the pretrained models released in theNvidia Clara Imaging package5, rather than computer visiondatasets.

5https://developer.nvidia.com/clara

Page 6: Le Peng, Taihui Li, Ju Sun

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90 AUROC

NF Ca. LO Ed. At. PE0.3

0.4

0.5

0.6

0.7

0.8

AUPRCDenseNet121-TLDenseNet121-RIResNet50-TLResNet50-RICBR-LargeT-TLCBR-LargeT-RI

CBR-LargeW-TLCBR-LargeW-RICBR-Small-TLCBR-Small-RICBR-Tiny-TLCBR-Tiny-RI

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90 AUROC

EC LL Co. Pn. Pt. PO Fr.0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

AUPRCDenseNet121-TLDenseNet121-RIResNet50-TLResNet50-RICBR-LargeT-TLCBR-LargeT-RI

CBR-LargeW-TLCBR-LargeW-RICBR-Small-TLCBR-Small-RICBR-Tiny-TLCBR-Tiny-RI

Figure 4: TL vs. RI on deep and shallow models for CheXpert. For almost all cases, RI performs on par with TL, but in anumber of cases, TL outperforms RI by visible gaps, especially when measured by AUPRC.

Page 7: Le Peng, Taihui Li, Ju Sun

NF EC Ca. LO LL Ed. Co. Pn. At. Pt. PE PO Fr.

0.06

0.04

0.02

0.00

0.02

Rela

tive

AUPR

C

Dens-T3Dens-S3CBR-LargeW

NF EC Ca. LO LL Ed. Co. Pn. At. Pt. PE PO Fr.0.05

0.04

0.03

0.02

0.01

0.00

0.01

0.02

Rela

tive

AURO

C

Dens-T3Dens-S3CBR-LargeW

Figure 5: Models compared with baseline full DenseNet121 (y=0.00) on CheXpert dataset. The dotted lines are the respectivestandard deviations of the baseline model across different diseases. Reported values represent the relative increasing, deceasingcompared with the baseline.

Dens-F up to block1 up to block2 up to block3Model

0.50

0.55

0.60

0.65

0.70

0.75

0.80

AUPR

C

SlimTrunc

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Trai

nabl

e Pa

ram

eter

s (M

illion

)

8.0

2.73.8

6.48.0

1.7 2.1

5.1

Figure 6: Truncated DenseNet121 models (ours) and slimDenseNet121 models (proposed by (Raghu et al. 2019))on COVID-19 dataset. The curves above show the trend ofAUPRC. The bar plot below shows the number of parame-ters of the models. The leftmost column provide the baselineusing the full model.

Figure 7: TL AUROC and AUPRC on COVID-19. Here weinclude all shallow models, and Dens-F is DenseNet121.Dens-T1 and Dens-T2 are among the best models.

Page 8: Le Peng, Taihui Li, Ju Sun

ReferencesAnthimopoulos, M.; Christodoulidis, S.; Ebner, L.; Christe,A.; and Mougiakakou, S. 2016. Lung pattern classificationfor interstitial lung diseases using a deep convolutional neu-ral network. IEEE transactions on medical imaging, 35(5):1207–1216.Antropova, N.; Huynh, B. Q.; and Giger, M. L. 2017. Adeep feature fusion methodology for breast cancer diagnosisdemonstrated on three imaging modality datasets. Medicalphysics, 44(10): 5162–5171.Bressem, K. K.; Adams, L. C.; Erxleben, C.; Hamm, B.;Niehues, S. M.; and Vahldiek, J. L. 2020. Comparing dif-ferent deep learning architectures for classification of chestradiographs. Scientific reports, 10(1): 1–16.Chen, S.; Ma, K.; and Zheng, Y. 2019. Med3d: Trans-fer learning for 3d medical image analysis. arXiv preprintarXiv:1904.00625.Deng, J.; Dong, W.; Socher, R.; Li, L.; Kai Li; and LiFei-Fei. 2009. ImageNet: A large-scale hierarchical imagedatabase. In 2009 IEEE Conference on Computer Vision andPattern Recognition, 248–255.Ghafoorian, M.; Mehrtash, A.; Kapur, T.; Karssemeijer, N.;Marchiori, E.; Pesteie, M.; Guttmann, C. R.; de Leeuw, F.-E.; Tempany, C. M.; Van Ginneken, B.; et al. 2017. Transferlearning for domain adaptation in mri: Application in brainlesion segmentation. In International conference on medicalimage computing and computer-assisted intervention, 516–524. Springer.Gulshan, V.; Peng, L.; Coram, M.; Stumpe, M. C.; Wu, D.;Narayanaswamy, A.; Venugopalan, S.; Widner, K.; Madams,T.; Cuadros, J.; Kim, R.; Raman, R.; Nelson, P. C.; Mega,J. L.; and Webster, D. R. 2016. Development and Valida-tion of a Deep Learning Algorithm for Detection of Dia-betic Retinopathy in Retinal Fundus Photographs. JAMA,316(22): 2402.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, 770–778.Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger,K. Q. 2017. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision andpattern recognition, 4700–4708.Huynh, B. Q.; Li, H.; and Giger, M. L. 2016. Digital mam-mographic tumor classification using transfer learning fromdeep convolutional neural networks. Journal of MedicalImaging, 3(3): 034501.Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.;Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpan-skaya, K.; et al. 2019. Chexpert: A large chest radiographdataset with uncertainty labels and expert comparison. InProceedings of the AAAI Conference on Artificial Intelli-gence, volume 33, 590–597.Jiang, Z.; Zhang, H.; Wang, Y.; and Ko, S.-B. 2018. Retinalblood vessel segmentation using fully convolutional networkwith transfer learning. Computerized Medical Imaging andGraphics, 68: 1–15.

Karimi, D.; Warfield, S. K.; and Gholipour, A. 2020. CriticalAssessment of Transfer Learning for Medical Image Seg-mentation with Fully Convolutional Neural Networks. arXivpreprint arXiv:2006.00356.Khalifa, N. E. M.; Loey, M.; Taha, M. H. N.; and Mohamed,H. N. E. T. 2019. Deep transfer learning models for medi-cal diabetic retinopathy detection. Acta Informatica Medica,27(5): 327.Kingma, D. P.; and Ba, J. 2014. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980.Lever, J.; Krzywinski, M.; and Altman, N. 2016. Classifica-tion evaluation. Nature Methods, 13(8): 603–604.Murphy, K. P. 2021. Probabilistic Machine Learning: Anintroduction. MIT Press.Nguyen, H. Q.; Pham, H. H.; Nguyen, N. T.; Nguyen,D. B.; Dao, M.; Vu, V.; Lam, K.; and Le, L. T.2021. VinBigData Chest X-ray Abnormalities Detec-tion. url=https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection.Raghu, M.; Zhang, C.; Kleinberg, J.; and Bengio, S. 2019.Transfusion: Understanding Transfer Learning for MedicalImaging. In Advances in Neural Information ProcessingSystems, volume 32, 3347–3357. Curran Associates, Inc.Rajpurkar, P.; Irvin, J.; Zhu, K.; Yang, B.; Mehta, H.; Duan,T.; Ding, D.; Bagul, A.; Langlotz, C.; Shpanskaya, K.;et al. 2017. Chexnet: Radiologist-level pneumonia detec-tion on chest x-rays with deep learning. arXiv preprintarXiv:1711.05225.Ravishankar, H.; Sudhakar, P.; Venkataramani, R.; Thiru-venkadam, S.; Annangi, P.; Babu, N.; and Vaidya, V. 2016.Understanding the mechanisms of deep transfer learning formedical images. In Deep learning and data labeling formedical applications, 188–196. Springer.Saito, T.; and Rehmsmeier, M. 2015. The precision-recallplot is more informative than the ROC plot when evaluatingbinary classifiers on imbalanced datasets. PloS one, 10(3):e0118432.Shi, H.; Han, X.; Jiang, N.; Cao, Y.; Alwalid, O.; Gu, J.;Fan, Y.; and Zheng, C. 2020. Radiological findings from81 patients with COVID-19 pneumonia in Wuhan, China:a descriptive study. The Lancet infectious diseases, 20(4):425–434.Shin, H.-C.; Roth, H. R.; Gao, M.; Lu, L.; Xu, Z.; Nogues,I.; Yao, J.; Mollura, D.; and Summers, R. M. 2016. Deepconvolutional neural networks for computer-aided detection:CNN architectures, dataset characteristics and transfer learn-ing. IEEE transactions on medical imaging, 35(5): 1285–1298.van Tulder, G.; and de Bruijne, M. 2016. Combining gener-ative and discriminative representation learning for lung CTanalysis with convolutional restricted Boltzmann machines.IEEE transactions on medical imaging, 35(5): 1262–1272.Williams, C. K. 2021. The Effect of Class Imbalance onPrecision-Recall Curves. Neural Computation, 33(4): 853–857.

Page 9: Le Peng, Taihui Li, Ju Sun

Wong, H. Y. F.; Lam, H. Y. S.; Fong, A. H.-T.; Leung, S. T.;Chin, T. W.-Y.; Lo, C. S. Y.; Lui, M. M.-S.; Lee, J. C. Y.;Chiu, K. W.-H.; Chung, T. W.-H.; et al. 2020. Frequencyand distribution of chest radiographic findings in patientspositive for COVID-19. Radiology, 296(2): E72–E78.Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014.How transferable are features in deep neural networks? InProceedings of the 27th International Conference on NeuralInformation Processing Systems-Volume 2, 3320–3328.Zeiler, M. D.; and Fergus, R. 2014. Visualizing and under-standing convolutional networks. In European conferenceon computer vision, 818–833. Springer.