External validation of three diabetes prediction scores in a Spanish cohort: does adding high risk for depression improve the validation of the FINDRISC score (FINDRISC-MOOD)?

Introduction

Diabetes risk scores such as FINDRISC,1 DESIR2 and ADA3 can be used to identify individuals who may require laboratory tests such as fasting plasma glucose, HbA1c and oral glucose tolerance test (OGTT). These risk scores can help stratify populations according to prognosis and enable the implementation of interventions to prevent cardiovascular complications and delay or halt the onset of diabetes mellitus (DM).

To ensure the accuracy of these risk scores, they are usually validated in the settings where they will be used (external validation), and the most appropriate cut-off point in terms of sensitivity and specificity is determined. In some cases, the FINDRISC score has been adapted to make it easier to use without compromising accuracy by removing some of the original variables4 5 or adding new ones.6

Other risk scores, such as the German Diabetes Risk Score,7 have been developed that include psychosocial variables such as perceived chronic stress. However, no diabetes risk score has included a history or high risk of depressive disorders, except the study of Atlantis et al.8 A recent meta-analysis9 showed that people with major depressive disorder have a higher risk of developing type 2 diabetes mellitus (T2DM). The Lifelines cohort study showed a relationship between high risk of depression and diabetes incidence in individuals with pre-diabetes at baseline after 9 years of follow-up.10 Therefore, we think that the incorporation of depression risk, defined as a Patient Health Questionnaire-9 (PHQ-9) score of 10 or higher, into the original FINDRISC score could improve the external validation of FINDRISC to identify the risk of T2DM.

The present study aims to validate three diabetes risk scores (FINDRISC, DESIR, ADA) in the Spanish population to predict the incidence of T2DM after a long follow-up period (median 7.3 years) and to test the value of adding high risk of depression to the FINDRISC score in terms of area under the receiver operating characteristic curve (AUROC).

Material and methods

Design

This study was conducted as part of a broader project funded by the Spanish Instituto de Salud Carlos III (PI 1500259). It included the Screening PRE-diabetes and type 2 DIAbetes (SPREDIA-2) study, which has been described in detail elsewhere.11 SPREDIA-2 is a population-based prospective cohort study, in which baseline visits were scheduled from July 2010 to March 2014.

Population

Baseline visit

The study population comprised a random sample of 2553 subjects living in the north of the city of Madrid (Spain) in an area served by 10 primary healthcare centres. Of these, 1592 (62.4%) agreed to participate, and 1426 had not been previously diagnosed with DM.

Recruitment was divided into three stages:

  1. Potential participants were sent a letter signed by their general practitioner explaining the aims of the study and inviting them to participate.

  2. Subjects were contacted by telephone to clarify any doubts and, if interested, were given an appointment to be assessed.

  3. The patient attended the assessment at the Carlos III Hospital outpatient clinic after an overnight fast.

A fasting blood sample was taken on arrival at the outpatient clinic to determine levels of glucose, creatinine, uric acid, HbA1c, serum insulin, lipids and lipoproteins. Immediately after blood sampling, all subjects not previously diagnosed with DM underwent an OGTT with 75 g of anhydrous glucose in a total fluid volume of 300 mL. A second blood sample was taken 2 hours later. The measurement questionnaires were as follows: diabetes risk scores (FINDRISC, DESIR and ADA); the PHQ-912; the 14-item Questionnaire to assess adherence to the Mediterranean diet (PREDIMED)13 and the 12-item Short-Form Health Survey.14 A full clinical history was taken. Alcohol consumption was measured as the number of units of alcohol per week.

Follow-up

Participants were followed up for a median of 7.3 years between the baseline visit and 31 December 2019 using their general practitioners’ electronic health records (EHRs). The EHRs had previously been validated15 and used in epidemiological studies.16 The participants were also contacted by telephone during the last year of follow-up to ascertain whether they were alive and, if so, to record health status, including the incidence of T2DM or cardiovascular events. The interview was conducted by a researcher trained in obtaining medical data by telephone. The study flow chart is shown in figure 1.

Figure 1
Figure 1

Flow chart of people included in the study. DM, diabetes mellitus; OGTT, oral glucose tolerance test; PHQ-9, Patient Health Questionnaire-9.

Measurement tools and definitions of criteria

FINDRISC risk score

The Finnish Diabetes Risk Score is one of the most widely used.1 It includes eight variables (anthropometric and lifestyle), namely, age, body mass index (BMI), waist circumference, family history of diabetes, use of blood pressure medication, history of high blood glucose levels, daily physical activity and daily intake of vegetables, fruits and berries. FINDRISC assesses the likelihood of developing T2DM over the next 10 years. The score ranges from 0 to 26 points, and the usual cut-off is 15. A risk score of 0–14 points indicates a low-moderate risk of diabetes (1%–17% risk of diabetes over 10 years), and 15–20 points indicates a high risk of diabetes (33% risk of diabetes over 10 years). A score of 20–26 points indicates a very high risk of diabetes (50% risk of diabetes over 10 years).17

DESIR risk score

DESIR was designed by Balkau et al
18 in the French population. The component variables differ by sex. In women, the variables include waist circumference (cm), family history of diabetes and arterial hypertension, while for men, they include waist circumference (cm), current smoking status and arterial hypertension. The waist circumference categories differ by sex (for women, <70, 70–79, 80–89, ≥90; for men, <80, 80–89, 90–99, ≥100). The score ranges from 0 to 5 points (a higher score means a higher risk).

ADA risk score

The American Diabetes Association risk score3 was developed based on the US population older than 20 years without DM to identify individuals at high risk for DM or pre-diabetes. It includes the following variables: age, sex, race, weight, height, family history of DM, history of gestational DM, history of arterial hypertension and physical activity. The total score ranges from 0 to 11. A score of 5 or higher indicates a high risk of DM.

PHQ-9

This validated and reliable scale has been used in many research studies.19 It is used to assist primary care practitioners in diagnosing depression and monitoring treatment. The question put is ‘Over the last 2 weeks, how often have you been bothered by any of the following problems?’, with four response options: (0) Not at all; (1) Several days; (2) More than half the days and (3) Nearly every day. The total PHQ-9 score is assessed by adding together the scores for all nine items. Higher scores on this measure indicate greater depression. Scores are categorised into five levels of severity: minimal=0–4; mild=5–9; moderate=10–14; moderately severe=15–19 and severe=20–27. The optimal cut-off for major depression disorder is 10.20 The IPD meta-analysis confirmed a cut-off of 10 as yielding the best balance of sensitivity (85%) and specificity (85%).21 It takes approximately 2–5 min to administer. Additionally, it can be self-administered.22

FINDRISC-MOOD risk score

This is an adaptation of the original FINDRISC risk score in which five new points are added if participants scored a positive PHQ-9 for depression (≥10 points). In contrast, no points are added if the PHQ-9 is negative. We decided that five points was appropriate based on the value of the beta coefficient obtained in the PHQ-9 after adjustment for the FINDRISC score in the logistic multivariate analysis for the prediction of diabetes (beta coefficient=0.632). This coefficient was multiplied by 9, which was the smallest common multiplication factor possible to obtain a sensitive score.

Anthropometric measurements were obtained using standard methods.23

Diagnostic criteria

Incidence of DM (gold standard): Incident cases of diabetes were identified by treatment for diabetes (n=16), fasting plasma glucose ≥126 mg/dL (n=21), new diagnosis in the EHR (International Classification of Primary Care, Second Edition T90 code) (n=6) or self-reported diagnosis in the telephone interview (n=1). In addition, 60 individuals met more than one criterion.

Metabolic syndrome: It was defined according to ATPIII diagnostic criteria.24

Statistical methods

The statistical analyses were conducted using IBM SPSS Statistics for Windows, V.26.0 (IBM Corp, Armonk, New York, USA) and MedCalc for Windows, V.15.8 (MedCalc Software, Ostend, Belgium). The sociodemographic and clinical characteristics of the study population at baseline are presented as frequencies and percentages for categorical variables and as means and SDs for continuous variables. Between-group comparisons were performed using a χ2 or Fisher’s exact test for categorical variables and a t-test or Kruskal-Wallis test for continuous variables.

To calculate the sample size, the following assumptions were accepted: an α error of 0.05, a precision rate of 9% in a bilateral contrast, for an estimated specificity rate of 80% and an estimated incidence of DM of 6%; the total sample size required was 1217 participants.

The performance of the diabetes risk scores was assessed using the following indicators: sensitivity, specificity, positive and negative predictive values, Youden Index defined as (sensitivity+specificity−1) and positive and negative likelihood ratios. All scores were calculated using incident T2DM as the gold standard. Statistical significance was set at p<0.05 for a two-tailed test.

The discriminative accuracy of the different risk scores was assessed and expressed as the AUROC and corresponding 95% CIs. AUROCs were compared between scores using MedCalc software.

The cut-off points for the risk scores to identify incidence of T2DM were determined by the point with the shortest distance to the upper left corner of the ROC, as calculated using the Youden Index.

Patient and public nvolvement

It was not appropriate or possible to involve patients or the public in the design, or conduct, or reporting, or dissemination plans of our research.

Results

Characteristics of the study population

Of the original 1426 participants, 1344 (94.2%) met the criteria for follow-up. Of these, 1242 (92.4%) were finally contactable via EHRs or telephone interviews. The reasons for exclusion are summarised in figure 1. The main characteristics of the participants stratified by sex are shown in table 1. At baseline, the mean age of the study population was 62 years. A high percentage had a family history of DM (31.6%) and a low prevalence of cardiovascular disease (eg, coronary artery disease, stroke and peripheral artery disease (3.1%, 2% and 0.8%, respectively)). One-third of the population met the criteria for current smoking, arterial hypertension and metabolic syndrome, and approximately 50% met the criteria for consumption ≥1 units of alcohol per week. Approximately half of the patients had dyslipidaemia, and one in five had regular or poor self-perceived health. The percentage of participants with a high score on the PHQ-9 was 11.6%; this was significantly higher among women. In terms of current treatment, nearly one in four participants were taking statins, and one in five were taking renin-angiotensin system blockers.

Table 1

Baseline characteristics of the population studied stratified by sex

Incidence of T2DM

During 7.3 years (median) of follow-up, 104 participants (8.4%; 95% CI, 6.8% to 9.9%) developed T2DM. Table 2 shows the differences between participants with and without incident T2DM for the main characteristics examined. The risk factors for which values were significantly higher in the group with incident T2DM were hypertension, metabolic syndrome, BMI, waist circumference, systolic and diastolic blood pressures, fasting plasma glucose, OGTT result, HbA1c, impaired glucose tolerance, self-administered PHQ-9 score and diabetes risk scores (FINDRISC, DESIR and ADA).

Table 2

Baseline characteristics of the study population stratified by incidence of diabetes/no diabetes (median 7.3 years of follow-up)

Patients treated with renin-angiotensin system blockers were statistically significantly more likely to be in the diabetes group.

Diabetes risk scores usually include questions about lifestyle, diet and medical history. Online supplemental table 1 shows the differences between the two groups. The group without incident T2DM was more likely than the group with incident T2DM to perform at least 30 min of physical activity, to eat vegetables, fruit or berries every day, to have never taken medication for high blood pressure, to have never been diagnosed with high blood sugar and to have never had gestational diabetes.

Supplemental material

Performance of diabetes risk scores

The performance of the FINDRISC score is shown in table 3. The best cut-off point was >14, achieving a sensitivity of 47.12% (95% CI, 37.2% to 57.2%), specificity of 81.37% (95% CI, 79% to 83.6%) and positive likelihood ratio of 2.53 (95% CI, 2.0 to 3.21). The AUROC was 0.68 (95% CI, 0.65 to 0.71).

Table 3

Performance of the FINDRISC diabetes risk score in predicting incident diabetes mellitus after 7.3 years (median) of follow-up

The FINDRISC-MOOD diabetes risk score, that is, the original FINDRISC plus five points if PHQ-9 >10, showed the same cut-off point as FINDRISC, although the sensitivity and specificity were more balanced (56.7% and 76.7%, respectively) (table 4). The negative predictive value was 95.08%, slightly higher than that of the original FINDRISC score. The AUROC increased to 0.70 (95% CI, 0.67 to 0.72).

Table 4

Performance of the FINDRISC-MOOD questionnaire in predicting incident diabetes mellitus after 7.3 years (median) of follow-up

Finally, the performance of DESIR and ADA is shown in online supplemental table 2. The best cut-off points were >12 and >5, respectively. The AUC of the ROC was almost equal, with 0.66 (95% CI, 0.63 to 0.68) for DESIR and 0.661 (95% CI, 0.63 to 0.69) for ADA.

The results of the bivariate comparisons of the AUROCs are provided in online supplemental table 3. The differences between values were not statistically significant. The greatest difference was between the FINDRSC-MOOD and DESIR scores (z=1.841, p=0.0657).

Mortality

There were 24 deaths during follow-up, that is, a crude mortality rate of 2.57 (95% CI, 1.64 to 3.82) per 1000 person-years. When stratified by FINDRISC score, those with a FINDRISC ≤14 had a crude mortality rate of 2.44 (95% CI, 1.45 to 3.86) per 1000 person-years and those with a score >14 had a crude mortality rate of 3.03 (95% CI, 1.11 to 6.61) per 1000 person-years. The rate ratio was 1.24 (95% CI, 0.41 to 3.27), p=0.628.

The crude mortality rate among PHQ-9 negative participants was 2.42 (95% CI, 1.48 to 3.73), and the crude mortality rate among PHQ-9 positive participants was 3.70 (95% CI, 1.01 to 9.48). The rate ratio between the two groups was 1.53 (95% CI, 0.38 to 4.57), p=0.434.

Similar results were observed with FINDRISC-MOOD. Those with a score below 14 had a crude mortality rate of 2.32 (95% CI, 1.32 to 3.76) per 1000 person-years, and those with a score >14 had a crude mortality rate of 3.27 (95% CI, 1.41 to 6.45) per 1000 person-years. The rate ratio was 1.41 (95% CI, 0.52 to 3.50), p=0.427. In this sense, a FINDRISC-MOOD score >14 indicates a slight increase in crude mortality compared with the same score in the traditional FINDRISC questionnaire, probably due to the increased mortality risk in PHQ-9 positive individuals, as we found (online supplemental figure 1).

Discussion

It is widely accepted that it is not possible to create a perfect prediction rule. In this sense, when using questionnaires to predict future diabetes, achieving a sensitivity of at least 80% (20% false negatives) is an adequate result. This approach is optimal because diabetes risk scores are cost-effective and can be administered in a clinical or community setting.

The original FINDRISC score of ≥9 showed this high sensitivity (80%). However, high sensitivity can also lead to low specificity, which means a higher rate of false positives. This misdiagnosis can lead to unnecessary pharmacological treatment and psychological harm to patients. Therefore, obtaining laboratory measurements to identify false positives is essential for questionnaires with low specificity. The Youden Index showed an adequate cut-off of >14 for the original FINDRISC to maintain an appropriate balance between sensitivity and specificity.

To compare the discriminatory ability of each diabetes risk score, we calculated the AUROC. Several authors have attempted to improve the AUROC of predictive risk scores by adding laboratory measures such as fasting plasma glucose,25 26 HbA1c alone27 and HbA1c plus fasting plasma glucose.28 This strategy improves discriminative ability but hinders the goal of rapid and noninvasive prediction using easily measured variables. The predictive diabetes risk scores could be considered as a pre-screening tool to identify individuals who would benefit from the measurement of fasting plasma glucose or HbA1c. This approach represents a relatively cost-effective screening programme, as highlighted by the International Diabetes Federation.29

Our strategy to improve the AUROC of original FINDRISC was the addition of five points when the PHQ-9 questionnaire score is greater than 10 points.21 However, the result was not statistically significantly different from the original FINDRISC score. Although, the AUROC of FINDRISC-MOOD presented an essential advantage given that it reaches a value of 0.70, the minimum to consider that a test provides sufficient discrimination. The p value for comparing the AUROCs was less than 0.10 between FINDRISC-MOOD and ADA (difference between areas=0.0374; p=0.0808), and very close to 0.05 between FINDRISC-MOOD and DESIR (difference between areas=0.0413; p=0.0657).

Our study was not designed to detect small differences between AUROCs. We would have required a larger sample size to achieve that. However, if there had been sufficient power to detect differences, the question that arises is whether the PHQ-9 questionnaire can be used in conjunction with the FINDRISC questionnaire in primary clinical practice. The PHQ-9 questionnaire is self-administered and does not require much time. This fact is particularly interesting because depression should be screened for regularly, as per the recommendations of the US Preventive Services Task Force.30 Additionally, the ADA recommends screening for depression in patients with diabetes.31 Depression is associated with up to a 65% increased risk of DM,32 33 making the addition of the PHQ-9 questionnaire a useful tool for assessing DM risk. Moreover, the PHQ-9 can be considered a first-line tool for diagnosing depression in primary care settings due to its ease of administration, good acceptability and sensitivity in detecting depression.30

Multiple authors have created predictive models for diabetes using variables commonly included in most diabetes risk questionnaires. Atlantis et al developed one such model, which also accounted for the high risk of anxiety and affective disorders by including the Kessler Psychological Distress Scale (K10). Atlantis et al compared two models, one with the K10 and one without. However, they did not use hypothesis testing to compare the two AUROC values. They found that the differences were of low magnitude.8 Also, the AUROC of their model was similar to that of FINDRISC-MOOD. Our study took a different approach. We tested the performance of three diabetes risk scores and found that adding five points to the original FINDRISC score when the PHQ-9 score was higher than 10 points did not statistically improve the AUROC.

Owing to differences in lifestyle and the prevalence of chronic diseases such as obesity, pre-diabetes and diabetes between communities and ethnicities, it is often necessary to calculate the optimal cut-off point for diabetes risk scores for each country, even state or region. From a statistical point of view, the best cut-off point is the one that achieves a higher Youden Index, although as previously mentioned, this strategy can be modified depending on the purpose of screening. According to the highest Youden Index, the best cut-off points for the FINDRISC and FINDRISC-MOOD questionnaires are the same (≥14).

The cutoffs used change over time and according to study site. In Europe, the original FINDRISC designed by Linström and Tuomilheto in Finland1 showed an optimal cutoff ≥9 in two consecutive cohorts after 10 years of follow-up with the following performance parameters (Youden Index, sensitivity, specificity and AUC) 0.59, 78%, 81% and 0.85 in the 1987 cohort and 0.53, 77%, 76% and 0.87 in the 1992 cohort. In the Norwegian general population aged over 20 years, the best FINDRISC cut-off, according to the highest Youden Index, was ≥11 (sensitivity, 73%; specificity, 67%) after 10 years of follow-up.

The study by Alssema et al with subjects from the HOORN34 35 (n=1434), PREVEND36 37 (n=2713) and MORGEN38 39 (n=863) cohorts in the Netherlands included six variables from FINDRISC (age, BMI, waist circumference, use of blood pressure medication, history of high blood glucose and family history of DM). The total score of the original FINDRISC ranged from 0 to 22, and it yielded two acceptable cutoffs for each cohort. Values (Youden Index, sensitivity and specificity) were as follows: HOORN cohort, 0.28, 52% and 76% for a cutoff ≥10 and 0.26, 84% and 42% for a cutoff ≥7; PREVEND cohort, 0.28, 43% and 85% for a cutoff ≥10 and 0.42, 78% and 64% for a cutoff ≥7; MORGEN cohort, 0.30, 47% and 83% for a cutoff ≥10 and 0.27, 75% and 52% for a cutoff ≥7.40 The second study by Alssema et al
41 included the previous variables used in 2008, as well as sex and smoking, with an optimal cut-off of ≥7 (Youden Index, 0.39; sensitivity, 76%; specificity, 63%).

In Spain, a population-based prospective study performed in the town of Pizarra (Málaga) followed 824 individuals for 6 years to evaluate the performance of FINDRISC. The best prediction of the risk of incident T2DM was found in subjects with a FINDRISC cut-off of 9 with an AUROC of 0.75. No information was provided on sensitivity, specificity or the Youden Index.42

In contrast to most studies, the AUROC of the Pizarra study was closer to our results, probably owing to a common lifestyle pattern and a progressive decrease in discriminative ability, as is relatively common in contemporary studies.

Our study has several limitations. First, the incidence of DM was not measured by the same method as at baseline (OGTT). However, because we used four sources of data (self-reported diagnosis, diagnosis by general practitioner, baseline plasma glucose levels and use of hypoglycaemic medications), which are consistent with other authors,43 44 information bias is unlikely. Second, the FINDRISC score estimates the risk of participants aged 35–64 years developing T2DM within 10 years; our study included some patients aged 65 years or older with a follow-up period of 7.3 years. This may have changed the accuracy of the results. Third, the diabetes risk scores in our study were performed on a representative population in the north of the city of Madrid. However, lifestyle, diet and prevalence of obesity may differ from the rest of the city, given the lower gross domestic product in the southern area. This may limit extrapolation to these areas or other Spanish cities. Finally, we did not have an exact date for the self-referred cases, which limits our ability to perform event-free analysis.

Our study also has important strengths. The study population reflects the average risk of DM, without a large number of people at high risk of developing T2DM, as in other studies conducted in hospital samples,27 45 patients with chronic infectious diseases46 and regions with a high prevalence of sedentary lifestyle and obesity.27 Another strength is that the extension of the original FINDRISC, FINDRISC-MOOD, is an easy-to-use tool that does not require additional investment in health professionals, as the PHQ-9 is typically self-administered.47

Acknowledgments

We would like to thank all members of the SPREDIA-2 Group: Fernando Laguna (Hospital Carlos III), Pedro Fernández-García (Hospital Carlos III), Luis Montesano-Sánchez (Hospital Carlos III), Pedro Patrón (Hospital Carlos III), Leopoldo Pérez-Isla (Hospital Clínico de San Carlos), David Vicent (Hospital Carlos III), Ignacio Vicente (CS Monóvar), Sara Artola (CS Ma Jesús Hereza), Ma Isabel GranadosMenéndez (CS Monóvar), Domingo Beamud-Victoria (CS Felipe II), Isidoro Dujovne-Kohan (CS Los Castillos), Rosa María Chico-Moraleja (Hospital Central de la Defensa), Carmen Martín-Madrazo (CS Monóvar), Rosario Echegoyen de Nicolás (CS Benita de Ávila), Concepción Aguilera Linde (CS Ciudad Periodistas), Álvaro R Aguirre De Carcer Escolano (CS La Ventilla), Patricio Alonso Sacristán (CS Ciudad Periodistas), M Jesús Álvarez Otero (CS Dr Castroviejo), Paloma Arribas Pérez (CS Santa Hortensia), Maria Luisa Asensio Ruiz (CS Fuentelarreina), Pablo Astorga Díaz (CS Barrio Pilar), Begoña Berriatua Ena (CS Dr Castroviejo), Ana Isabel Bezos Varela (CS José Marva), María José Calatrava Triguero (CS Ciudad Jardín), Carlos Casanova García (CS Barrio Pilar), Ángeles Conde Llorente (CS Barrio Pilar), Concepción Díaz Laso (CS Fuentelarreina), Emilia Elviro García (CS Ciudad Periodistas), Orlando Enríquez Dueñas (CS Fuentelarreina), María Isabel Ferrer Zapata (CS El Greco), Froilán Antuña (CS Ciudad Periodistas), Maria Isabel García Lazaro (CS Ciudad Periodistas), Maria Teresa Gómez Rodríguez (CS Barrio Pilar), África Gómez Lucena (CS La Ventilla), Francisco Herrero Hernández (CS La Ventilla), Rosa Julián Viñals (CS Dr Castroviejo), Gerardo López Ruiz Ogarrio ―in memoriam‖ (CS Barrio Pilar), Maria Del Carmen Lumbreras Manzano (CS José Marva), Sonsoles Paloma Luquero López (CS Ciudad Periodistas), Ana Martínez Cabrera Peláez (CS Barrio Pilar), Montserrat Nieto Candenas (CS La Ventilla), María Alejandra Rabanal Carrera (CS Barrio Pilar), Ángel Castellanos Rodríguez (CS Ciudad Periodistas), Ana López Castellanos (CS La Ventilla), Milagros Velázquez García (CS Barrio Pilar) and Margarita Ruiz Pacheco (CS Dr. Castroviejo).

This post was originally published on https://bmjopen.bmj.com