Detection of fibrosing interstitial lung disease-suspected chest radiographs using a deep learning-based computer-aided detection system: a retrospective, observational study

STRENGTHS AND LIMITATIONS OF THIS STUDY

  • This was a retrospective, observational study.

  • This study used a chest radiograph dataset that was randomly extracted from the pooled radiograph images consecutively taken in multiple medical facilities with various degrees of referral, reflecting real-world clinical settings.

  • This study compared the detection capability of interpreters, including non-experts and experts, for identifying fibrosing interstitial lung disease (ILD)-suspected images without and with the assistance of the computer-aided detection system.

  • The main limitation is that the ground-truth label was determined based on signs of fibrosing ILD on chest radiographs and not on high-resolution CT images.

Introduction

Interstitial lung disease (ILD) is a group of diseases that cause various degrees of fibrosis and inflammation in the lungs.1 Patients with fibrosing ILD, including idiopathic pulmonary fibrosis (IPF), experience progressive disease,2 which can be slowed by antifibrotic drugs.3–5 Nintedanib, an antifibrotic drug, is also reported to decrease the incidence of acute exacerbation.4 5 However, these antifibrotic treatments cannot stop the progression of fibrosis completely; hence, early detection and intervention are crucial to stall the deterioration of lung function and prevent acute exacerbation.

Chest imaging is centrally important for the early identification of fibrosing ILD since abnormalities will sometimes be visible on chest radiographs for years before symptoms develop.6 However, it is challenging for non-expert physicians, including primary care physicians, to identify fibrosing ILD on chest radiographs, particularly in the early stages of disease.7 To address this, we previously developed a deep learning-based computer-aided detection (CAD) algorithm to detect fibrosing ILD on chest radiographs.8 We showed that the CAD algorithm could detect fibrosing ILD with high accuracy (area under the receiver-operating characteristic curve (ROC-AUC), 0.979), which was not inferior to detection by pulmonologists and radiologists. However, whether such CAD systems can significantly improve the detection capability of human doctors in clinical settings remains to be elucidated.

This study investigated the effectiveness of BMAX (Cosmotec, Tokyo, Japan), a deep learning-based CAD system for detecting fibrosing ILD on chest radiographs based on our previous development study,8 to assist in the identification of fibrosing ILD-suspected images by non-expert and expert physicians. We analysed the detecting capabilities of physicians using the ROC-AUC method without and with the assistance of BMAX and calculated conversion rates from incorrect to correct judgement of physicians using BMAX.

Materials and methods

Study design

This was a retrospective, observational study that used chest radiograph images taken consecutively over a period of 6 months or less in three medical institutions in Japan: Kurashiki Central Hospital, Okayama (a tertiary hospital); Hamana Hospital, Shizuoka (a community [non-tertiary] hospital); and Yokohama Gumyoji Respiratory Medicine Internal Clinic, Kanagawa (a primary care clinic).

BMAX is a deep learning-based CAD system developed for detecting chronic fibrosing ILD on chest radiographs that was based on our previous study.8 This CAD system calculates and outputs a numerical score (range: 0–1 to three decimal places (1.000 represents the strongest confidence)) that reflects the confidence of whether fibrosing ILD is present on an input radiograph image. It also presents an on-screen alert flag when the score exceeds the threshold value (0.299) (figure 1). BMAX V.2.0.4, which was used in the current study, is commercially available in Japan.

Figure 1
Figure 1

Representative screenshot without (A) and with (B) the output of BMAX, a computer-aided detection software.

Pooled radiograph dataset

Chest radiographs of patients aged 20 years or older with two or more visits that were taken during a consecutive period were accumulated at each participating institution. Images were taken in January 2007 at Kurashiki Central Hospital, from April to August 2017 at Hamana Hospital, and from October to December 2021 at Yokohama Gumyoji Respiratory Medicine Internal Clinic. Eligible radiographs had been taken in the standing position at a distance of 100–200 cm from the X-ray source to the film, with a pixel count of at least 1750×1750 pixels, and were not excessively postprocessed. We excluded radiographs thought to be inappropriate for a valid interpretation (eg, obscure images). When multiple images were taken from an individual, only one radiograph image was randomly selected. After inappropriate images were excluded, there were 1251 chest radiograph images in the pooled dataset (figure 2). Each chest radiograph image was assigned a reading image identification code.

Figure 2
Figure 2

Flowchart for dataset preparation. n, number of images.

Ground-truth labels

The ground-truth labels for each chest radiograph were determined by 3 expert ILD physicians (qualified pulmonologists with 15 or more years of experience). These three label makers independently interpreted pooled radiographs in a random order and categorised each image as either fibrosing ILD suspected (fibrosing ILD positive) or not (fibrosing ILD negative). For positive images, the label makers independently stratified each image by the proportion of fibrosing area (rate of fibrosis-infiltration area to the whole lung field; 0%, >0% to <25%, ≥25% to <50% or ≥50%) and recorded the location of fibrosis infiltration (predominant in the upper zone, limited in the upper zone, or predominant/limited in the middle or lower zone) for each side of the lung. Abnormal findings in negative images were also recorded.

Ground-truth labels and fibrosis location were determined by the majority vote of the three label makers; as an exception, when different location judgements were made by all three label makers, the location was determined as ‘no fibrosis’ (no study images met this exception). The proportion of fibrosing area was also determined by the majority vote of the three label makers; as an exception, when decisions were different among the three label makers, the maximum area was recorded as the final decision.

In the fibrosis area stratification analysis, the proportion score for each side of the lung was scored as 0, 1, 2, or 3 for the proportion of fibrosing area to the whole lung field of each side (0%, >0% to <25%, ≥25% to <50% and ≥50%, respectively). The overall area proportion score (range: 0–6) was calculated as the sum of the proportion score of each side of the lung.

Testing radiograph dataset

The number of positive and negative images for the testing dataset was set to 24 and 96 (total 120), respectively, and they were randomly extracted from the pooled data (first, positive images were extracted, followed by negative images) (figure 2). During this process, when the number of positive and negative images extracted from 1 of the 3 institutions exceeded 48 (40%), the remaining negative images were extracted from the other 2 institutions.

Performance of the stand-alone BMAX

The AUC of the ROC curve was composed of the sensitivity and false positive (FP) rate for every fibrosing ILD confidence score calculated by BMAX. The confidence scores between true positive (TP) and FP, and false negative (FN) and true negative (TN) were compared.

Comparing detection capabilities without and with CAD

In total, 5 expert physicians (qualified pulmonologists (n=3) and radiologists (n=2) with 5 or more years of clinical experience) and 20 non-expert physicians participated in the interpretation of the testing radiograph dataset. The testing interpreters did not include any of the interpreters of the pooled dataset (label makers). The testing interpreters independently interpreted the radiographs of the testing dataset in a random order. Each image was first interpreted and classified as positive or negative for fibrosing ILD without the assistance of BMAX. Next, the same image was interpreted and classified using BMAX.

The ROC curve of each interpreter was determined using sensitivity and the FP rate of each interpreter (method details are shown in online supplemental figure 1).9 We calculated the mean of the ROC-AUC of each interpreter and compared the ROC-AUC without and with the assistance of BMAX. The mean sensitivity, specificity and accuracy (the ratio of TP and TN images in the whole testing radiograph dataset (n=120)) without and with BMAX were also compared among both non-experts and experts.

Supplemental material

The sensitivity and specificity of stand-alone BMAX, experts (without and with BMAX) and non-experts (without and with BMAX) were further stratified using the following axes: (1) the fibrosing area score (<3 or ≥3) for positive images and (2) the presence or absence of abnormalities for negative images.

The conversion rates (from FP to TP, from FP to TN, from TN to FN and from TN to FP) after using BMAX were also calculated. Conversion rates of each interpreter were calculated as follows: the conversion rate from FN to TP of each interpreter was the ratio of ILD-positive images in which the interpreter changed the judgement from negative to positive in all ILD-positive images; and the conversion rate from FP to TN of each interpreter was the ratio of ILD-negative images in which the interpreter changed the judgement from positive to negative in all ILD-negative images.

Statistical methods

The Wilcoxon two-sided signed-rank test was used to compare mean ROC-AUC, sensitivity, specificity and accuracy in the interpretation without and with the assistance of BMAX. We employed the Mann-Whitney U test to compare the sensitivity and specificity of the experts and non-experts, and the confidence scores between TP and FP, and FN and TN. We used the chi-square test to compare the sensitivity and specificity of BMAX.

A p value of <0.05 was considered statistically significant. Statistical analyses were performed using SAS V.9.4M3 and Python V.310 11 (Python Software Foundation; Fredericksburg, Virginia, USA).

Patient and public involvement

Patients or the public were not involved in the design, or conduct, or reporting, or dissemination plans of our research.

Results

Participants and image characteristics

The pooled dataset included 1251 chest radiograph images taken in Kurashiki Central Hospital (n=982), Hamana Hospital (n=125) and Yokohama Gumyoji Respiratory Medicine Internal Clinic (n=144). Among them, 61 images were determined as fibrosing ILD positive by label makers (n=44 from Kurashiki Central Hospital, n=13 from Hamana Hospital and n=4 from Yokohama Gumyoji Respiratory Medicine Internal Clinic) (figure 2). The testing radiograph dataset included 120 chest radiographs taken in Kurashiki Central Hospital (fibrosing ILD-positive/ILD-negative images, 16/31), Hamana Hospital (5/30) and Yokohama Gumyoji Respiratory Medicine Internal Clinic (3/35).

Radiographic characteristics included in the testing radiograph dataset are shown in table 1. Among the positive images, the proportion of images with a fibrosing area score of 1, 2, 3, 4, 5 and 6 was 29%, 25%, 8%, 17%, 8% and 13%, respectively, indicating that over half the images in the testing radiograph dataset showed a limited fibrosis area (score<3). Most positive images presented with limited/dominant fibrosis in the middle and/or lower lung field (right lung, 75%; left lung, 88%). Most negative images showed no abnormality (82%).

Table 1

Image characteristics of chest radiographs in the testing dataset (n=120)

Performance of the stand-alone BMAX

The AUC of the ROC curve, which is composed of the sensitivity and FP rate for every fibrosing ILD confidence score calculated by BMAX, was 0.913 (figure 3). When the fibrosing ILD confidence score cut-off was set to 0.299 (ie, when BMAX presented an alert flag on the screen), the sensitivity and specificity of BMAX were 0.792 and 0.844, respectively.

Figure 3
Figure 3

Receiver operating characteristic curve of the stand-alone BMAX.

The cut-off that maximised the Youden Index based on the current result was 0.353, which was close to the preset cut-off point in BMAX (0.299). This cut-off (0.353) produced a sensitivity and specificity of 0.792 and 0.875, respectively.

The confidence score was compared between TP, FP, FN and TN (online supplemental figure 2). We observed that the BMAX scores were significantly lower for FP than TP (p=0.001) and FN scores were significantly higher than TN (p=0.016). We also divided the scores ranging from 0 to 1 into bins with increments of 0.1 and tallied the number of positive and negative images within each bin (online supplemental table 1). The majority of positive images had scores above 0.9, while the majority of negative images had scores below 0.1. These results indicate that a high BMAX score indicates a high confidence level for a positive result, and conversely, a low BMAX score indicates a high confidence level for a negative result.

Also, when the confidence score was stratified by the fibrosing area score for each ILD-positive image, there was a general trend of higher fibrosing area scores being associated with higher BMAX scores (online supplemental figure 3).

Comparing detection capabilities without and with BMAX

The mean ROC-AUC of all 25 interpreters was 0.814 without BMAX, which was significantly improved to 0.839 with BMAX (p=0.004). The mean ROC-AUC of the 20 non-expert interpreters was 0.795 without BMAX, which was significantly improved to 0.825 with BMAX (p=0.005). However, the mean ROC-AUC of the five expert interpreters was 0.892 without BMAX and 0.897 with BMAX; the difference was not significant (p=0.465).

There was a significant improvement in sensitivity when BMAX was used by non-expert interpreters (0.744 without BMAX vs 0.802 with BMAX; p=0.003) (table 2). There was a numeric improvement in sensitivity with BMAX assist in expert interpreters, which was not significant (0.900 without BMAX vs 0.917 with BMAX; p=0.285).

Table 2

Sensitivity, specificity and accuracy without and with the assistance of BMAX

Specificity was not significantly changed with the BMAX assist among either non-expert interpreters (0.846 without BMAX vs 0.847 with BMAX; p=0.690) or expert interpreters (0.883 without BMAX vs 0.877 with BMAX; p=0.285) (table 2). Accuracy was not significantly improved among either non-expert interpreters (0.826 without BMAX vs 0.838 with BMAX; p=0.123) or expert interpreters (0.887 without BMAX vs 0.885 with BMAX; p=0.785) (table 2).

In non-expert interpreters, sensitivity for images with a smaller fibrosing area (fibrosing area score<3) was significantly improved by the BMAX assist (0.562 without BMAX vs 0.669 with BMAX; p=0.002) (online supplemental table 2), but not in expert interpreters (0.815 without BMAX vs 0.862 with BMAX; p=0.180). For images with a fibrosing area score≥3, improvement was not observed in either non-expert (0.959 without BMAX vs 0.959 with BMAX; p=1.000) or expert interpreters (1.000 without BMAX vs 0.982 with BMAX; p=0.317).

In the analysis stratified by the fibrosing area score for positive images and presence or absence of abnormalities for negative images, non-experts without BMAX were more prone to misdiagnosing positive images with fibrosing area score<3 and negative images with abnormalities. Experts without BMAX also demonstrated a similar tendency to misinterpret cases, particularly with respect to a fibrosing area score, although this was not as pronounced as with the non-experts. BMAX also showed a tendency to struggle with positive images with smaller area scores, but not to the same degree as the non-experts (online supplemental tables 3,4).

The median conversion rate from TP to FN was 0% among both non-expert and expert interpreters, which was smaller than that from FN to TP (non-expert interpreters, 8.33%; expert interpreters, 4.17%) (online supplemental table 5). Representative images with a high conversion rate from FN to TP among non-expert interpreters are shown in online supplemental figure 4. All three images had limited fibrosing areas.

Discussion

The aim of the current study was to determine improvements in the detecting capability for fibrosing ILD-suspected chest radiographs among non-specialists and specialists when using a deep learning-based CAD system in real-world clinical settings. We accumulated chest radiographs of patients who had two or more visits that were consecutively taken in medical facilities with various degrees of referral (a primary care clinic, a community hospital and a tertiary referral hospital) during a defined period. Chest radiographs of patients who were only seen at the medical facility once were excluded from this study because inclusion of these images might have led to excessive ILD-negative data. Ground-truth labels for each chest radiograph were determined based on interpretations by expert ILD physicians. Although a more accurate diagnosis could be made based on the results of CT interpretation, selecting patients who underwent CT examination would result in a large bias because CT examination is reserved for patients who have severe symptoms or who have presented with abnormal shadows on a chest radiograph. Furthermore, in primary care clinics and non-referral hospitals, it is physicians rather than radiologists who interpret chest radiographs. Thus, we used the interpretation of chest radiographs by expert ILD physicians to determine ground-truth labels.

The use of BMAX significantly improved the ROC-AUC and sensitivity of non-expert interpreters for detecting fibrosing ILD-suspected images; sensitivity was particularly improved for images with a smaller fibrosing area. These results indicate that using BMAX could facilitate the detection of minor fibrosis signs by non-expert interpreters that were difficult to detect without CAD. We were concerned that using BMAX may increase the number of FPs detected; however, specificity did not decrease with BMAX for either non-experts or experts. The use of BMAX did not statistically improve the sensitivity, specificity or accuracy of detecting fibrosing ILD-suspected images for expert interpreters. This result might be influenced by the fact that the ground-truth labels were based on the interpretation of expert ILD physicians. In our previous study, it was indicated that our deep learning-based algorithm, which was the basis for BMAX, could detect subtle fibrosing signs that were difficult for even expert physicians or radiologists to detect on chest radiographs and that were identified only from CT images.8 In addition, the sensitivity and specificity of our algorithm were superior to those of expert physicians and radiologists.8 We assume that in the present study, fibrosing ILD-negative radiographs in the testing dataset included images with latent fibrosis that expert physicians could not identify on chest radiographs. Thus, if ground-truth labels had been determined using chest CT, BMAX may have improved the sensitivity of expert interpreters.

Diagnostic delays in fibrosing ILD, and especially in IPF, is currently of great concern. The median delay (defined as the time from the dyspnoea onset to tertiary care centre access) in patients with IPF has been reported as 2.2 years, with longer delays associated with an increased risk of death independent of the disease severity, even in the pre-antifibrotic drug era.12 Antifibrotic drugs, which can reduce the annual decline of forced vital capacity in patients with IPF and progressive fibrosing ILD, are now available3–5; however, in the majority of patients, these agents cannot improve lung function or completely stop the deterioration of lung function. The antifibrotic drug nintedanib prolongs time to first acute exacerbation in patients with IPF and progressive fibrosing ILD.4 5 Therefore, early identification of patients with fibrosing ILD and introduction of treatment at an appropriate timepoint are thought to be crucial for prolonging the prognosis. Hoyer et al also reported on the consequences of diagnostic delay for patients with IPF. They found that diagnostic delay was mainly attributable to the patients themselves, general practitioners and community hospitals.13 Factors contributing to diagnostic delays may include the non-specific nature of the symptoms and an insufficient knowledge about fibrosing ILD among physicians.14 Improved detection tools are expected to aid the early diagnosis of fibrosing ILD.14 In the present study, we showed that BMAX improved the sensitivity for detecting fibrosing ILD-suspected images among non-expert physicians in the real-world clinical setting. The use of BMAX is expected to be beneficial in reducing the incidence of overlooked fibrosing ILD on chest radiograph examination among non-expert physicians and to lead to improved outcomes for patients with fibrosing ILD. The improvement in sensitivity with versus without BMAX among non-expert physicians was approximately 6%. If one assumes that most patients with fibrosing ILD first appear in primary care clinics or regional hospitals, this improvement is thought to be of great significance.

This study has some limitations that must be considered when interpreting the findings. First, the robustness of diagnosis of fibrosing ILD was somewhat weak, as mentioned above, given that the ground-truth label was determined based on signs of fibrosing ILD on chest radiographs rather than CT images. Radiographs with fibrosing signs that were not identified even by expert ILD physicians (label makers), but which were identified by BMAX, might be considered FPs when the testing interpreters followed the judgement of BMAX. Moreover, radiographs with abnormalities other than fibrosis that mimic fibrosing ILD (eg, bronchiectasis, pulmonary emphysema) might be determined as positive for fibrosing ILD by label makers. If BMAX correctly determined such radiographs as negative for fibrosing ILD, the testing results might be FN when the testing interpreters followed the judgement of BMAX. Second, in images classified as fibrosing ILD-positive, the fibrosing lesion was mostly present in the middle to lower zones of the lung; the testing radiograph dataset included only one image with the upper lung area dominant fibrosis and no image with the upper lung area limited fibrosis. It should be elucidated whether the efficacy of BMAX is adequate for identifying fibrosing ILD with upper lung-limited or upper lung-dominant fibrosis. Third, we tried to accumulate data equally from medical institutions with various degrees of referral because investigating data collected only from tertiary hospitals would have resulted in a large selection bias; however, much of the fibrosing ILD-positive data was ultimately derived from the tertiary hospital (Kurashiki Central Hospital). Accumulating a sufficient number of images of a rare disease such as fibrosing ILD in primary care clinics is difficult; however, a future prospective study might solve this issue.

Conclusions

The present study demonstrated that BMAX, a deep learning-based CAD system for detecting fibrosing ILD-suspected chest radiographs, improved the detecting capability of non-expert physicians, particularly in radiographs with a smaller fibrosing area.

Data availability statement

No data are available. The data in this study are proprietary and provided under collaboration agreements, and thus cannot be made public.

Ethics statements

Patient consent for publication

Ethics approval

This study involves human participants. The study was approved by the Medical Corporation Rikeikai Yamauchi Clinic Ethics Review Board (approval number: 2022-04-00153). The study was conducted in accordance with the ethical principles that have their origin in the Declaration of Helsinki and in compliance with the Ethical Guidelines for Medical Research Involving Human Subjects. The requirement to obtain informed consent from patients was waived because this study did not include any individual data other than chest radiograph images that had personal information removed. However, from the standpoint of product development, ethical considerations were taken into account and informed consent was obtained or opt-out opportunities were provided when possible.

Acknowledgments

The authors thank Sarah Bubeck, PhD, of Edanz (www.edanz.com) for providing medical writing support, which was funded by M3 and Nippon Boehringer Ingelheim in accordance with Good Publication Practice (GPP 2022) guidelines (https://www.ismpp.org/gpp-2022). The authors thank Akina Hirano, of M3, for constructive discussion of this study.

This post was originally published on https://bmjopen.bmj.com