Diagnostic prediction models for bacterial meningitis in children with a suspected central nervous system infection: a systematic review and prospective validation study

Introduction

Bacterial meningitis (BM) in children is lethal and debilitating, with mortality rates between 4% and 21% and neurological sequelae occurring in up to one-third of survivors.1–3 Early start of treatment is crucial for the prognosis as delay in antibiotic treatment is associated with adverse outcomes.4 However, limiting unnecessary use of antibiotics is important to minimise antibiotic resistance, adverse reactions, hospital admissions and healthcare costs.5

Recognition of BM can be difficult. The typical triad of fever, neck stiffness and altered mental status is present in only 41% of adult patients and is even less common in children and infants.6 7 Diagnostic prediction models have been developed to help identify which child should be treated for BM and in which a watchful waiting approach can be applied.8 The majority of these models combine clinical and laboratory findings and predict the probability of acute BM, compared with viral meningitis or no meningitis. However, substantial differences between these models exist, especially with respect to patient populations and diagnostic criteria. Validation of prediction models in a broader population of patients suspected of a central nervous system (CNS) infection is necessary because this is the population in which these models will be of clinical use, however is often lacking. External validation of 16 diagnostic prediction models for BM in a cohort of 363 adult patients with a suspected CNS infection showed that none of the existing models performed well enough to recommend routine use in individual patient management. However, these models were mostly developed for children and might therefore perform better in a paediatric population.

Our aim was to perform a systematic review of prediction models for BM and validate these models using a multicentre cohort of paediatric patients with a suspected CNS infection in whom a lumbar puncture was performed.

Methods

Systematic review

We systematically reviewed the literature in Medline to identify models that predict the probability of acute BM. The Standards for Reporting Diagnostic accuracy studies 2015 guidelines and Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis: checklist for systematic reviews and meta-analyses (TRIPOD-SRMA) were applied.9 10 We used a previously validated search filter for prediction models.11 We combined this filter with terms for meningitis and prediction models and searched for full-text articles in scientific peer-reviewed journals from January 1980 to 1 September 2022 in languages English, German, French, Spanish or Dutch. Prediction models were included if they contained at least three variables obtained from history, physical examination or simple laboratory tests and included children or adults. Publications describing the development, refinement or validation of a prediction model were included. Article screening and data extraction were performed by one researcher (NSG) and doubts were discussed and resolved by a second and third researcher (MCB and MWB). Quality of the reporting of the included studies (that were not previously included) was assessed according to the TRIPOD criteria, containing 6 domains with in total 22 items.12 Each item was scored as reported, reported incompletely and not reported. Characteristics and model performance of studies previously described (published before August 2018) were reported in the online supplemental material, and studies included after August 2018 were reported in the main manuscript.8

Supplemental material

Validation cohort

Data from the Paediatric and Adult Causes of Encephalitis and Meningitis (PACEM) study were used for validation of the included prediction models. This was a multicentre prospective study in three hospitals (the Amsterdam University Medical Centers, Onze Lieve Vrouwe Gasthuis in Amsterdam and the Flevoziekenhuis in Almere) in which patients were included if (1) aged 0–18 years old, (2) presented to the emergency department or admitted to the paediatric ward between January 2012 and July 2015 with a suspected CNS infection and (3) cerebrospinal fluid (CSF) examination was performed.13 A detailed description of the cohort was described previously.13

Change in mental status was defined as a Paediatric Glasgow Coma Scale (GCS) <14; coma was defined as a GCS <8.14 Episodes were categorised into six categories regarding final diagnosis: BM, other CNS infections, inflammatory CNS diseases, systemic infections, other neurological diseases and other systemic disease.15 All episodes were independently assessed by two clinicians (NSG and SLS) and discrepancies were discussed and resolved by a third and fourth clinician (MCB and MWB). BM was defined as (1) a positive CSF culture, or (2) a negative CSF culture but positive blood culture and elevated CSF leucocyte count, or (3) a negative CSF and blood culture, elevated CSF leucocyte count, elevated infection parameters in blood (CRP >80 mg/L or leucocytes >14 × 109 /L), clinical parameters suggesting a bacterial infection and the final diagnosis of the treating physician. Age-specific cut-off values for abnormal CSF leucocyte count, protein and glucose were used. In children below 3 months >9 leucocytes/mm3 was considered elevated; in children of 3 months or older >6 leucocytes/mm3 was used as cut-off.16–18 CSF protein >1000 mg/L and CSF glucose levels <60% of blood glucose levels were considered abnormal. CSF leucocyte count was corrected for CSF erythrocytes by subtracting one leucocyte for every 700 erythrocytes/mm3.

Statistics

The differences in baseline characteristics between BM and non-BM patients were identified with parametric and non-parametric tests. χ2 tests and Fisher’s exact tests were used to compare categorical outcomes.

The performance of the prediction models was assessed by evaluating discrimination and calibration.19–21 The different prediction models were considered as index test; diagnosis of BM was considered reference standard. Discrimination was assessed by calculating the area under the receiver operating characteristic curves (AUC) with 95% confidence intervals (CIs). Calibration was evaluated with the calibration curve, assessing the calibration slope and calculating the calibration in the large. Discriminative ability was categorised as follows: excellent discrimination in case of an AUC of ≥0.90; good discrimination for 0.80≤ AUC <0.90; fair discrimination for 0.70≤ AUC <0.80; and poor discrimination in case of an AUC <0.70.22 When cut-off values for defining high risk of BM had been provided in the original article, sensitivity, specificity and predictive values for the high-risk categories were calculated.

For some of the multivariable logistic regression prediction models, beta coefficients from the original publication were not reported. In this case, we used the observed proportions in the risk categories as reported in the original cohort data, as expected proportions in those risk groups in the validation data.

In case of unclear or missing information about the specifics of the prediction model, authors were contacted for additional information.

In models that reported the complete multivariable logistic regression model, proportions of BM assigned to the different risk categories, defined by the original model, were calculated to display the clinical significance of this spread. A sensitivity analysis in neonates (age <28 days) and in children ≥28 days of age was performed, because of the different presentation of BM at the neonatal age.23

The median number of missing values per variable was 12% (IQR 4%–42%). Missing data were handled by multiple imputation using the R package MICE. We used 51 variables (Supplementary methods) from medical history, physical examination and laboratory results as predictors to impute missing values.24 Missing data were regarded as missing at random and different regression models were used depending on the type of variable: logistic regression for dichotomous variables, polytomous regression for unordered categorical variables with >2 levels and predictive mean matching for numerical variables. For skewed numerical variables and derived variables (e.g., age <7 days), the so-called passive imputation was applied with log or square-root transformation where appropriate. A total of 30 imputation sets were generated based on the observed percentage missingness. If only one predictor from the model was not available in the PACEM dataset, the prediction model was validated without that particular variable. When two or more predictors from the model were not available in our dataset, the model was not validated. If a continuous predictor was missing but the median or mean value (of the original cohort) of that predictor was reported in the original article, this mean/median value was set as observed value in the validation of the prediction model. For discrimination and calibration, we used R packages pROC 2225 and predictABEL.26

We used Rubin’s rule and bootstrapping to estimate proportions and c-statistics based on the imputation sets.27 All statistical tests were two-tailed and p values of <0.05 were considered statistically significant.

Patient and public involvement

None.

Results

Systematic review

Our literature search yielded 7724 articles of which 40 publications on diagnostic prediction models for acute BM were included. In total, 29 publications described the derivation of a total of 32 prediction models and 8 articles described an update, validation or explanation of a model.28–65 In total, 20 articles on 23 prediction models were included for validation in our study (online supplemental figure S1). Thirteen publications validated one or more existing models in their dataset (tables 1 and 2).28 31 34 43 45 47 49 54 56–60 66 All models were based on clinical characteristics and/or laboratory test results from blood and CSF. Characteristics of the derivation cohorts and performance measures of original models published after August 2018 are presented in tables 1 and 2, and models published before August 2018 were described in detail previously, with the exception of one study published in 2011 that was not described before (online supplemental tables S1 and S2).8 45 A total of 23 models were developed in children29–31 33–36 38–43 45 46 48–51 53 61 62 64 65 of which 7 in neonates,34 42 53 64 65 with a median cohort size in the derivation studies of 398 (IQR 158–908) patients. Five models were developed in both adults and children40 50 63 67 and three in adults.32 33 52 The most frequent quality limitations were retrospective derivation cohorts, lack of reporting on handling of missing data and little information about differences in distribution of important variables between the derivation and validation cohort (online supplemental table S3).

Table 1

Characteristics of included prediction models (published after August 2018)

Table 2

Derivation and previous validation of identified prediction models (published after August 2018)

Description of cohort

Between 2012 and 2015, a total of 468 episodes were included, of which 450 episodes could be used in the analysis (table 3). Reasons for exclusion were lack of information in online and paper files (n=14), multiple admissions of one patient in a short timeframe (n=2) and age at admission of 18 years or older (n=2). Included patients were female in 194 out of 450 (43%) cases, and median age at admission was 1.5 months (IQR 0.4–12). A total of 75% of children were <1 year old, 40% of children were <28 days old and 92 of 176 (52%) of neonates were born prematurely. For the analyses, three cohorts were used: the entire cohort of all children (n=450), neonates only (<28 days of age, n=176) and children aged ≥28 days (n=274). The neonate cohort included 16 (9%) BM cases and the children aged ≥28 days included 14 (5%) BM cases.

Table 3

Baseline characteristics validation cohort*

Symptoms were present <24 hours in 227 of 402 (56%) patients. Most common symptoms were fever in 268 of 420 (64%), irritability in 193 of 429 (45%), meningeal irritation in 48 of 249 (19%) and a decreased level of consciousness 80 in 450 (18%). Median CSF leucocyte count in all children was 4 (IQR 1–9). CSF examination showed elevated leucocytes (corrected for CSF erythrocyte count) in 70 of 258 (27%) patients below 3 months old and in 32 of 166 (19%) patients of 3 months old or older. CSF protein was elevated in 73 of 419 (17%) in all patients and CSF to blood glucose ratio was decreased in 104 of 263 (39%) in all patients.

CNS infection was diagnosed in 74 of 450 (16%) of patients, of which 30 (41%) had BM, 39 (53%) viral meningitis and 5 (7%) infectious encephalitis. Other diagnose categories included CNS inflammatory disease (3%), systemic infection (61%), other neurological disease (14%) and other systemic disease (6%). CSF culture was positive in 10 of 30 patients (30%) clinically diagnosed with BM and showed Streptococcus pneumoniae in 3, S. agalactiae in 3, Neisseria meningitidis in 2, Haemophilus influenzae in 1 and Escherichia coli in 1. Blood culture was positive in 11 of 30 BM patients (37%) and showed S. agalactiae in 4, H. influenza in 1, S. pneumoniae in 2, K. pneumoniae in 1, N. meningitidis in 2 and E. coli in 1. One patient had a positive CSF culture for S. pneumonia and a negative blood culture, one patient had a positive blood culture for K. pneumoniae and a negative CSF culture and in one patient CSF culture failed due to a traumatic lumbar puncture and blood culture was positive for S. agalactiae

Validation of prediction models

We validated 23 prediction models in our cohort: discrimination and calibration could be calculated for 13 models and sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) could be calculated for 20 models. The models of Cheng, Chen, Dalai, Dubos, Freedman, Mintegi and Pelkonen were excluded for validation because these models included two or more predictors that were not available in our PACEM dataset.34 38 39 43 49 64 65 The model of Mentis and the model of Li were not validated because these model were not reported in sufficient detail to perform validation.42 63 In total, 36 (80%) of the total number of 45 predictors from the 25 models were available in our dataset (online supplemental table S4). For the model of Boum, we were unable to assign points for a positive Brudzinski or Kernig sign; however, since the variable neck stiffness was considered a good proxy, this model was validated in our cohort. For the model of Spanos, the logistic regression model was used to calculate the AUC and calibration, and the CSF predictors (Spanos criteria) were used to calculate sensitivity, specificity, NPV and PPV. For the model of Huang, the median of lactate dehydrogenase as observed in the original cohort was imputed to validate the model in our cohort; however, this led to a negligible difference in predictions when compared with leaving out this variable. For the model of Mintegi, three out of a total of seven points could be appointed for procalcitonin, but this value was not available in our dataset and the contribution of this variable to the total model was considered high. Therefore, this model was not validated by using observed values from the original cohort. For the model of Bonsu (2008) and de Cauwer, the predicted probabilities could not be calculated because the betacoefficients of the original modal could not be retrieved. However, observed proportions per risk intervals from the original cohort were available and were used as the expected proportion for validation.30 35 Finally, we could assign no more than two points for duration of symptoms in days for the model of Oostenbrink because these data were not available in our dataset.48 We adjusted the original cut-off for the model of Oostenbrink by reducing the maximum amount of points with the percentage of points that were not available due to the missing values. Moreover, predictive values were calculated for the combined Oostenbrink model only.

All children

Discrimination was excellent in 2 models in all children and good in 6 of 13 models (table 4). The AUCs in these models ranged from 0.69 to 0.94 (median 0.83, IQR 0.79–0.87). The models of Bonsu, Nigrovic, the clinical model of Oostenbrink and the LRM model of Spanos showed an AUC below 0.80, indicating fair discrimination. In all children, the second model of Huang scored best in terms of discrimination with an AUC of 0.94 (CI 0.91 to 0.97). Moreover, sensitivity of Huang was 83% (95% CI 80% to 87%), with 90% specificity (95% CI 87% to 93%), 37% PPV (95% CI 33% to 42%) and 99% NPV (95% CI 99% to 100%). The calibration slopes indicated poor fit of all the models and none of the calibration curves showed reasonable agreement between the predicted and observed probability. Only the model of Boyer and Nigrovic showed a p value of the calibration slope >0.05. Moreover, calibration-in-the-large showed overestimation or underestimation in all of the models (table 4, online supplemental figure S2). As an example, the proportions of BM in low-risk and high-risk groups are shown in online supplemental table S5.

Table 4

Discrimination and calibration for all children

Median sensitivity of the 20 models was 80% (IQR 68%–91%) overall (table 5). NPV was ≥99% in 3/20 (15%) models overall.30 35 46 None of the models showed a sensitivity and NPV of 100% in all children.

Table 5

Sensitivity, specificity and predictive values for all children

Median specificity was 67% (IQR 49%–91%) overall. Highest specificity was reached by the model of Deivanayagam (100%, 95% CI 99% to 100%) and the second model of Mirkhani (99%, 95% CI 98% to 100%) overall. Sensitivity of these models was only 24% (95% CI 20% to 28%) and 30% (95% CI 26% to 34%), respectively. Performance of models that were originally developed in children or neonates (n=13) differed from models developed in adults (n=3) or both children and adults (n=4), with median sensitivity in child models of 83% (IQR 80%–92%) compared with 58% (IQR 44%–74%) in (child and) adult models, and median specificity of 60% (IQR 44%–80%) in child models compared with 91% (IQR 75%–95%) in models developed in adults or both. The combination of sensitivity and specificity was best in the models of de Cauwer (resp. 97%, 50%) and Nigrovic (resp. 94%, 51%), and both showed an NPV of ≥99%.

Neonates

Discrimination was excellent in 2 models in neonates and good in 4 out of 13 models. The AUCs ranged from 0.58 to 0.91 (median 0.79, IQR 0.75–0.82, online supplemental table S6). Calibration in the large showed overestimation or underestimation in all of the models validated in the neonate cohort. Median sensitivity of the 20 models was 78% (IQR 59%–90%, online supplemental table S7). None of the models showed a sensitivity of 100%. Median specificity was 70% (IQR 45%–88%) in neonates. Only the model of Deivanayagam showed a specificity of 100% (95% CI 99% to 100%), with a sensitivity of 17% (95% CI 11% to 23%). Models developed in neonates (n=3, median sensitivity 88% (IQR 81%–88%), median specificity 86% (IQR 67%–88%)) overall performed better compared with models that were developed in children or adults or both (n=17, median sensitivity 73% (IQR 53%–91%), median specificity 52% (IQR 43%–88%)) in this cohort of neonates.

Children ≥28 days of age

Discrimination was excellent in six models and good in six models in children ≥28 days of age (online supplemental table S8). The AUCs ranged from 0.74 to 0.96 (median 0.89, IQR 0.82–0.92). Calibration in the large showed overestimation or underestimation in all of the models; however, the CSF model of Oostenbrink and the model of Boyer showed reasonable calibration with a slope of 1.1 (95% CI 0.6 to 1.5) and 1.0 (95% CI 0.5 to 1.4), respectively. Median sensitivity of the models was 82% (IQR 70%–93%). One model (De Cauwer) showed a 100% sensitivity, however with a specificity of 50% (IQR 44%–56%, online supplemental table S9). Moreover, nine models in this cohort showed a NPV of 99% or higher. Median specificity was 70% (IQR 48%–92%) in this cohort.

Models developed in children ≥28 days of age, adults or both (n=17, median sensitivity 85% (IQR 64–93), median specificity 65% (IQR 40–92)) overall showed higher sensitivity but lower specificity compared with models developed in neonates (n=3, median sensitivity 79% (IQR 75–84), median specificity 92% (IQR 74–92)) in this cohort of children aged ≥28 days of age.

Discussion

We validated 23 clinical and laboratory-based diagnostic prediction models for bacterial meningitis, identified in a systematic review, using a cohort of 450 children with a suspected CNS infection. Quality of the included studies varied widely regarding study design, statistical analyses and reporting on model-building procedures. Discrimination was excellent in two models and good in 6 out of 13 models (4 out of 13 in neonates). Calibration showed relevant overestimation or underestimation of BM by all models. However, a sensitivity of 100% is required for implementation in clinical care due to the devastating consequences of missing this disease. Therefore, the results of these prediction models should always be incorporated with all information from the patient’s history, physical examination and ancillary investigations, and although some of these models provide valuable diagnostic information, they should not be used as a stand-alone test.

Children models performed worse in our children cohort compared with the adult cohort in which they were validated previously.8 Moreover, all models validated in this study performed worse than in their original publication. This is likely largely due to differences in case-mix between our validation cohort and the original derivation cohorts. Patient’s age differed significantly between the derivation cohorts, ranging from models derived in preterm neonates only to patients of all ages or adults, whereas our validation cohort consisted of children aged 0–18 years old. Even though discrimination overall was good to excellent, we show calibration was poor, which can be expected and can have many causes. Often, patient characteristics change over time and disease incidence or prevalence rates can vary between different populations, healthcare settings and countries. When an algorithm is developed in a cohort with a high incidence of BM, it may systematically give overestimated risk estimates (poor calibration in the large) when used in a setting with a lower BM incidence, as is the case in this study. Moreover, BM symptoms can vary greatly between different age groups. For example, neonates often show very non-specific signs and symptoms in meningitis, resulting in a different effect of clinical predictors and therefore poorer calibration slopes when validating a clinical characteristics-based models in this population. Other common causes of reduced discrimination and calibration are related to methodological problems regarding the algorithm itself, such as statistical overfitting. When using these models in new populations, it is important to take into account that the incidence of the outcome in the new population, as well as the different effect of predictors in this new population, will result in varying calibration, depending on the population in which the model is used. Before applying a model to a new population, recalibration of the model will sometimes be needed to adjust for these differences.

We chose to validate the prediction models in a broad cohort of patients with a suspected CNS infection because this is the population in which these models will be used in daily practice Therefore, validation in such a cohort provides a realistic view. Moreover, it shows that these models cannot be directly applied to all new populations without validating and adjusting the model to the specific characteristics of the population accordingly.

The question also remains if machine-learning-based algorithms outperform clinical judgement. A systematic review on comparing diagnostic prediction models with clinical judgement for various medical conditions found that prediction models reduced the proportion of missed diagnoses in only 2 out of 46 publications.68 This was offset by a larger amount of false positives as well. Comparing the combination of clinical judgement assisted by prediction rules to clinical judgement alone would provide the most valuable information on the added value of prediction models on patient outcome, but studies on this topic are lacking thus far.

To date, a large amount of prediction models for BM have been developed but none showed excellent discrimination and calibration when validated in a broader population of all patients suspected of a CNS infection.

However, comparing the discriminative performance of prediction models for BM to an ideal diagnostic test with excellent discrimination might not be fair, since a diagnostic test for paediatric BM with 100% sensitivity and 100% specificity does not exist in clinical practice. All current tests show limitations that should be taken into account when assessing the results of an individual patient. Diagnostic prediction models could aid in addition to other diagnostic investigations, however should not be used on their own.

Future research should also focus on different ways of improving diagnosis in paediatric BM. Better biomarker-based point-of-care tests that can accurately exclude and include BM in children are needed, especially in complex cases in which definite diagnosis is still unclear after conventional CSF examination.

Our research has some limitations. First, some models included variables that were not available in our dataset. These models were validated without those variables, which could lead to difference in performance. Second, a substantial proportion of data were missing and had to be imputed. Although 30 different imputation sets were used, this could have led to some distortion of the performance measurements. Third, the number of patients with BM in our validation cohort was limited. Our validation cohort included 450 patients, including 30 patients with BM (7%), possibly leading to less reliable and generalisable results. When including a small number of cases, the findings are more susceptible to random variation. Because CIs were broad, performance in larger cohorts could find better performance. Last, 18 of the total of 468 (4%) episodes could not be included due to lack of data on predictor or outcome variables. Lack of data in the paper files of these episodes could be a result of an acute setting with an ill patient resulting in less time for data documentation, possibly influencing the amount of severely ill (BM) patients.

In conclusion, this review analysed 40 articles on diagnostic prediction models for BM in children and validated 23 prediction models in a multicentre prospective cohort of 450 children suspected of a CNS infection. The models showed good to excellent diagnostic accuracy with poor calibration in all models. Diagnostic prediction models could be of help in the diagnostic workup of paediatric BM but are not recommended to use on their own in routine individual patient care. Future research should focus on the added value of prediction models in clinical practice.

This post was originally published on https://bmjopen.bmj.com