AI-based X-ray fracture analysis of the distal radius: accuracy between representative classification, detection and segmentation deep learning models for clinical practice

STRENGTHS AND LIMITATIONS OF THIS STUDY

  • Three major artificial intelligence (AI) methods of fracture assessment in clinically applicable representative implementations are directly compared by output homogenisation in postprocessing.

  • Due to its clinical focus, this study eschewed fundamental technical comparisons of the methods’ respective underlying mechanics.

  • With the inclusion of custom and commercial solutions, the study offers insight to healthcare providers looking for assistive AI in imaging diagnostics.

Introduction

Artificial intelligence (AI) in radiologic diagnostics is well established,1–3 with predictive power in cases like fracture detection exceeding human performance.4 5 In the foreseeable future, broad utilisation of available and affordable models on simple, yet decisive problems will likely contribute to general diagnostic accuracy of imaging studies and thus improve the standard of care.6 7

While the number of medical companies providing automated AI-based pathology detection is increasing,7–10 affordable hardware and publicly available software tools enable anyone to train their own high-performing and sophisticated models on a reasonably sized dataset. Consequently, until diagnostic use of AI is standardised and regulated, the modern diagnostic radiologist is confronted with a wide variety of possible forms of AI assistance.

The task of fracture detection in computer vision can be expressed as correctly classifying a radiographic image into fracture and non-fracture classes by detecting fracture features in the image. Apart from direct image classification,11 AI in the form of neural networks has also been successfully applied for object detection12 and semantic segmentation13 14 by identifying fracture elements within images instead of classifying them as a whole.

Their ability to point out the perceived fracture location makes object detection and segmentation results easily relatable to the radiologist and arguably more impactful for clinical work, whereas image classification originally lacks spatial information. This information could to some certainty be reconstructed from activation gradients of trained classification models via a gradient-weighted class activation map (grad-CAM)15 of the input image, which theoretically indicates the regions of highest relevance for final classification, if valid and not falsely relying on confounding features.16 17

In order to assess these approaches, we compared the performances of three representative models on a dataset of local emergency department wrist radiographs. Distal radius fractures are the most common adult fracture and their correct treatment important for functional outcome. At our institution, if history and physical examination suggest a fracture, biplane X-ray is performed and evaluated by orthopaedic surgeon and radiologist on call, both deciding on either CT or follow-up X-ray in doubtful cases. Either party could benefit from the consistency and speed of an AI tool’s additional supportive assessment.

We chose a commercially available, sophisticated AI-solution trained on multicentre data for direct fracture (as an object) detection to establish a standard for the quality of our dataset as well as to achieve an objective reference for network performance. Custom networks were trained for the other approaches of classifying the whole examination and segmenting exact fracture regions.

Methods

Patient and public involvement

Patient and public were not involved in any way in the design, conduct or reporting of the study.

Dataset

We ported 2856 randomly selected initial examinations of the wrist consisting of standard ap and lateral projections from the emergency department of our institution performed between 2008 and 2018 on adult patients into the NORA (Nora, the medical imaging platform) research platform18 for further processing (figure 1).

Figure 1
Figure 1

Flow chart of imaging data. PACS, Picture Archiving and Communication System.

The examinations were classified into fracture and non-fracture classes of the distal radius according to maximum fracture visibility in either projection.

As ground truth for the segmentation task, the fracture area was marked by overlaying a binary fracture foreground mask onto the image, assigning the unmarked pixels to the non-fracture background class. The dataset was randomly split into balanced test, training and validation sets, preserving fracture prevalence.

Quality control of imported examinations, image classification and fracture area segmentation were performed by a board-certified subspecialist on musculoskeletal imaging with over 10 years of experience under consideration of the relevant clinical documentation including follow-up imaging.

No relevant data were missing.

Networks and training

Image classification

To optimise image homogeneity, we equalised the images with respect to size, resolution and brightness spectrum by normalising the pixel-wise radiolucency values to a spectrum between −1 and 1 using built-in Tensorflow (V.2.6) and OpenCV (V.4.5.5) preprocessing functions. Images were resized and resampled to 1.6 times of final network input size to prevent information loss during subsequent batch-wise image augmentation, consisting of random cropping, rotation up to 20° and horizontal flipping. Resulting images had a uniform size of 384×512 pixels.

For our classification model, we modified a headless imagenet-pretrained Xception-network based on the keras implementation19 by adding a global average pooling and a drop-out layer (10% drop-out during training) as well as 2 dense layers (128 and 2 units, respectively). Empirically,20 model performances could be improved by replacing the superfluous two input layers in the third dimension of imagenet-pretrained networks with filtered copies of the original two-dimensional image after application of a brightness-inversion and an adaptive mean thresholding edge-enhancing filter, respectively. We conjecture generally facilitated fracture delineation by the corresponding and contrasting input information as the reason for improvement, similar to the common practice of performing brightness inversion on radiographs when looking for fracture or pleural lines. Network models, artificial image augmentation and network training were implemented using the Tensorflow library21 with Adagrad as a solver, monitoring model convergence on the training set in the integrated Tensorboard tool through loss, accuracy, precision, recall and area under the receiver operating characteristic (AUROC). Learning rate polygonally decayed from 0.1 to 0.005. Training was performed for 1000 epochs with 100 steps on balanced batches of 30 images.

Grad-CAM activation maps of the classification results were generated and visualised as heatmaps using the tf-explain library.22

Image segmentation

For fracture segmentation, we implemented a stack of two patch-based networks using the patchwork toolbox,23 24 wherein matrices of fixed size represent image areas of decreasing size in increasing resolution, paralleling the architecture of a U-Net.25 Matrix size of the patching and input for the networks was fixed to 128×128. Patch size was 128×128 pixels based on image resolution for detecting fine full details and limited to 300×300 pixels for coarse resolution to capture more adjunct information of the X-ray image. Data augmentation was performed on patch level with random rotation up to 20°, horizontal flipping and resizing up to 10%. The patches were randomly drawn while balancing the input to the network to improve the amount of fracture-containing patches. Models were trained on 10 million patches. Solver Adam was used with binary cross entropy as loss function and default framework settings without further hyperparameter optimisation so as to maximise model comparability. Batch size was set to 150 patches due to hardware restrictions.

Prediction was performed using randomly drawn patches of each image until a sufficient coverage of the image (>99%) was achieved.24 Since the resulting segmentation yielded pixel-by-pixel probabilities for containing a fracture, maximum pixel probability was defined as final image-level fracture probability.

Object detection

For the object detection approach, we applied BoneView (V.2.0.2),26 a proprietary detection algorithm by GLEAMER (Paris, France) on the images of our test dataset, which is based on the detectron2 library,27 and classifies examinations into fracture, doubtful fracture and no fracture classes according to their probability score and high-sensitivity to high-specificity optimised thresholds. In addition, fracture areas are marked with bounding boxes.

Given its US Food and Drug Administration-approved added value,26 28 we deem the solution by GLEAMER a suitable representative of commercial-grade detection algorithms.

Statistical analysis

We calculated accuracy, sensitivity, specificity, precision, recall and the F1-metric for each model and AUROC for the custom models. CIs were estimated via bootstrapping (10 000 resamples). Packages exact2×2 (V.1.6.6)29 and irr (V.0.84.1)30 on R (V.4.0.3)31 as well as custom functions in NORA were used for statistical analysis.

Results

The dataset consisted of 2856 examinations (table 1) in total containing 712 (24.9%) fractures. The test set held 400 examinations, 100 of which fractures, while training and validation sets consisted of 2057 and 399 examinations with 513 and 99 fractures, respectively.

Table 1

Demographic characteristics of dataset

After perusing initial results, we applied an arbitrary threshold of 0.50 to the maximum fracture class probability in both ap and lateral projections. This way, our classification Xception-model reached an accuracy of 0.96, whereas the segmentation patchwork-model reached an accuracy of 0.92 (table 2). Accuracies were maximised at a threshold of 0.92 for the Xception-model and 0.70 for the patchwork-model (table 3), reaching 0.97 for classification and 0.94 for segmentation.

Table 2

Model results at 0.5 threshold

Table 3

Model results at accuracy-optimised thresholds

BoneView achieved a global accuracy of 0.95 on our test set, classifying 31 examinations as doubtful, none of which contained actual fractures. For sake of comparison, these were conservatively regarded as fracture negative for further analysis. Two examinations could not be classified for technical reasons.

Given the proprietary nature of the object detection algorithm and subsequent unavailability of its internal threshold values, threefold model performance comparison relied on their accuracies. Other metrics showed the same tendency with the classification model scoring highest, segmentation model lowest and detection model in between. At arbitrary and optimised thresholds, the custom models showed higher specificity than sensitivity, while BoneView achieved balanced specificity and sensitivity of 0.95 each (tables 2 and 3).

The corresponding values for the AUROC of both custom models were 0.97 (95% CI 0.94 to 0.99) for classification (Xception) and 0.96 (95% CI 0.93 to 0.98) for prediction through segmentation (patchwork) (figure 2). The reported AUROC of BoneView ranges from 0.94 to 0.97.26 32

Figure 2
Figure 2

ROC curves for fracture prediction by the custom classification (red) and segmentation (green) models, true (TP) over false positives (FP). AUROC given with 95% CIs in parentheses calculated by 10 000 bootstrapping steps. AUROC, area under the receiver operating characteristic; ROC, receiver operating characteristic.

All three models agreed on 81 fractured and 282 non-fractured wrists, with classification and detection achieving single-digit false negatives. Three false negatives and two false positives were shared among the models (online supplemental file 1). The differences among the three models were not significant in pairwise comparisons according to Boschloo’s test at p<0.001, neither was the respective deviation from ground truth. Pairwise Cohen’s kappa ranged from 0.80 to 0.86 (online supplemental file 1), Fleiss’ kappa for all models was 0.83. Pairwise disagreement was highest with 31 examinations between segmentation and detection and lowest with 26 between classification and detection.

Supplemental material

For fracture localisation, quantitative comparability was limited due to the different visualisation styles. Visual case-by-case comparison of fracture visualisation areas, however, showed good agreement in general (figure 3).

Figure 3
Figure 3

Visualisation of model predictions: (A) original image, (B) classification with grad-CAM heatmap, (C) predicted segmentation of fracture region, (D) BoneView-output with white bounding boxes for fracture regions. grad-CAM, gradient-weighted class activation map.

Discussion

In this study, representative models for each major AI-based image evaluation approach were trained and tested on 2856 wrist radiographs. All three achieved similar high performance with good accuracy and interrater agreement. With simple binarisation methods in postprocessing, custom models showed only negligible difference to the commercial model for this region.

BoneView’s internal thresholds for doubtful and certain fractures optimise false negative rate and specificity, respectively.26 Both effects were well balanced for our study set, as the doubtful class contained no actual fractures.

Accuracy-optimising the thresholds for the custom networks yielded small improvements in accuracy, specificity, precision and F1 score, while increasing false negative rates and decreasing sensitivity. Thus, accuracy of both models improved by capitalising on group size imbalance (online supplemental file 1). Optimal thresholds of 0.92 and 0.70 against the arbitrary 0.50 improved accuracy by only 0.01 and 0.02 (tables 2 and 3), so both models robustly differentiated between fracture and non-fracture. In this case, factors like the cost of positive or negative misdiagnosis on individual and population levels gain importance for model optimisation. For robustly trained models, sensitivity and specificity need to be consciously balanced against each other in the frame of their intended task, as obviously was the case with BoneView.

While, in general, commercial AI models are not optimised nor validated on the local imaging data they are applied on, their ideally multicentre training accounts for local differences. Our results suggest no palpable performance advantage of focused local training on about 2000 wrist images over diverse training on over 300 000 X-rays of various body regions including 3800 wrist images. Thus, BoneView’s consistently high performance on local data without site-specific retraining is matched by specialised custom models on isolated body regions (figure 2). Apart from sophisticated preprocessing and efficient network architecture, an effect of limited local feature variability is likely,1 contributing to faster model convergence and improved local compatibility.

For the investigated methods, accuracy increased from the segmentation model over the detection model to the classification model, while detail of fracture visualisation decreased from pixel-level over bounding box to image region (figure 3).

In the frame of this study, our classification model with least explainability would thus be suited best for automated fracture prediction, the segmentation model with its intuitive visualisation for AI-assisted human reading, and the detection model with its low false negative rate for minimising missed diagnosis over overdetection, pointing out regions of doubt even if manifest fracture features are absent.

In short, all three approaches showed comparable high performance for fracture prediction, suggesting interchangeability and freedom of choice in AI support. Still, inherent output visualisation differences were consistent with the trends in statistical comparison and predispose these methods for accuracy-optimised automated reading, explainability-optimised AI-assisted reading or clinical outcome-optimised fracture indication, respectively.

The similar performances of the investigated methods suggest a common underlying mechanism of entity recognition, which we did not explore in this application-oriented study. Architecture-specific optimisation of protocols and postprocessing approaches of higher complexity than our single projection-based maximum probability might produce relevant differences and need to be investigated in dedicated technical studies.

While BoneView is intended to detect fractures regardless of body region, the custom models were trained specifically on distal radius fractures. Occasional prediction of fractures outside the radius by the segmentation model suggests heavier reliance on general fracture features compared with the classification model. The same behaviour could be expected for multiple fractures per image, which to investigate exceeded the frame of this study.

Limitations

Our study was limited by its single-centre retrospective design. The dataset consists of X-ray images of one skeletal region only with limited patient information acquired on mostly one fixed scanner installation, such that confounding patient and hardware factors could not be explored. Future prospective studies including different body regions, imaging hardware and modalities will be necessary to assess the generalisability of our conclusions.

While postprocessing as described above allowed unified performance comparison of different AI solutions on an isolated clinical task, factors pertaining to everyday clinical workflow were not systematically defined and analysed. Reader-oriented specific studies would elucidate the validity of our observations in more dynamic settings.

Conclusions

As this study demonstrates, openly available network architectures and sophisticated software libraries enable medical centres to train fracture prediction models on a reasonable-sized local dataset to commercial-grade performance levels. Our three representative models achieved similar diagnostic capabilities regarding simple fracture prediction and are thus equally well suited for clinical application. Their inherent methodological differences and associated visualisation methods predispose them for different AI-tool flavours.

When choosing AI fracture evaluation tools, radiologists should, therefore, consider their intended role in clinical workflow, while ease-of-use, robust training as well as possibly external certification of a commercial solution must be weighed against the flexibility, extensibility and adjustability of custom solutions.

Data availability statement

Data are available on reasonable request. The image datasets are available on reasonable request according to the constraints of the DSGVO.

Ethics statements

Patient consent for publication

Ethics approval

Institutional Review Board (Albert-Ludwigs-University Freiburg) approval was obtained (file 570/19). Participants’ informed consent was waived.

Acknowledgments

The authors would like to thank Jeanne Ventre of Gleamer for providing additional details regarding BoneView’s training. The authors acknowledge support by the Open Access Publication Fund of the University of Freiburg.

This post was originally published on https://bmjopen.bmj.com