Introduction
Generation Scotland (GS) is a longitudinal health study established as a family-based and population-based resource for the study of the genetic, lifestyle and environmental determinants of common complex diseases. Non-communicable diseases, such as cancer, diabetes, stroke, heart, liver and lung disease, are the leading cause of morbidity and mortality in Scotland.1 2 The majority of common health disorders of public health concern are a result of a complex interaction between genes and environment. The GS cohort is rich in genetic and phenotypic information through data collection, sample assays and linkage to routine electronic health records. As a bio-resource for medical research, GS aims to support research to establish the determinants of physical and mental health and improve the prevention, diagnosis and treatment of common diseases.
GS was founded as a multi-institutional, cross-disciplinary collaboration between the Universities of Aberdeen, Dundee, Edinburgh and Glasgow and the National Health Service (NHS) Scotland, with key resources, expertise and input from the Medical Research Council Human Genetics Unit, the National eScience Centre and the Scottish School of Primary Care. A list of current staff working on developing and maintaining the GS resource and members of the scientific steering committee is provided in online supplemental appendix A.
Supplemental material
Between 2006 and 2011, over 24 000 adult volunteer participants completed questionnaires, attended a clinic visit or participated by post, and consented to genetic studies, linkage to their medical records and to be recontacted for future research.3
In 2022, GS launched Next Generation Scotland (NGS), aiming to expand the existing cohort by recruiting 20 000 new participants, newly including 12–17 year olds, meeting an unmet need to study adolescent health. Data are collected via an online questionnaire and saliva sample collection by post.
An earlier cohort profile paper described the baseline recruitment.4 Here, we report data enrichment of the cohort, including new biological data, longitudinal data linkage and recontact studies. We highlight the extent and nature of the data now available to researchers, summarise the use and impact of GS since commencement in 2006 and outline the current and future plans for NGS.
Participant recruitment
The original GS:Scottish Family Health Study (SFHS) protocol and baseline data profile have been described previously.3 4 Briefly, potential participants aged 35–65 years (study probands) were selected from lists of collaborating general medical practices in Scotland. They were invited to participate in the study and asked to identify at least one adult (18+ years old) first-degree relative to invite to the study. This included volunteers from the Glasgow and Tayside areas from 2006 to 2011 and was extended to include Ayrshire, Arran and Grampian in 2010. Participants completed a Pre-Clinic Questionnaire (PCQ) before attending a research clinic in Glasgow, Dundee, Perth, Aberdeen or Kilmarnock. In total, 126 000 individuals were invited to participate, of whom 6665 responded and met the study criteria (response rate of 5.3%). An additional 17 419 family members were recruited via these probands. The original GS:SFHS cohort therefore consists of 24 084 participants.
Baseline data collection
All 24 084 participants completed a PCQ collecting a range of demographic, social characteristics, personal behaviours and self-reported health data. Information collected included smoking status, alcohol consumption and personal and family disease history. Information was also collected on the birthplaces, by local council area, of participants and their parents and grandparents born in Scotland. In 2009, revisions were made to several questions within the PCQ for machine readability; the period prior to this was termed phase 1 (n=9967) and the period thereafter was termed phase 2 (n=14 117).
Most participants (21 476) also attended a research clinic, where physical measurements included height, weight, heart rate, systolic and diastolic blood pressure, ECG and body composition analysis. Standardised and well-validated assessments of cognitive function, personality and mental health included the 28-item General Health Questionnaire, Eysenck Personality Questionnaire, Structured Clinical Interview for DSM Disorders, Mood Disorder Questionnaire, Schizotypal Personality Questionnaire, Mill Hill Vocabulary test and WAIS-III logical memory test. All baseline measures collected are reported in the previous cohort profile.4
Cohort characteristics at recruitment
Of the 24 084 participants within the original GS:SFHS cohort, approximately 14 209 (59%) are female, with an average age at recruitment of 49 years. In total, 87% (approximately 20 953) of participants were born in Scotland and 97% (approximately 23 361) born within the UK. The cohort includes a range of sociodemographic characteristics, although compared with the Scottish population participants have a higher education level and lower deprivation index (table 1).
Family groups of at least two first-degree relatives were identified and assigned a shared family identity number. Pedigrees were constructed using relationship information provided by study participants and validated with genetic kinship information following genotyping. The cohort contained 1361 singletons (with no relatives in the study) and 5501 families of at least 2 people, with a mean size of 4.1 family members.
Longitudinal data linkage
Linkage to extensive and longstanding NHS Scotland records, both retrospective and prospective, creates a longitudinal cohort from baseline. The linkage data available within GS have not previously been described. At the time of writing, participants have up to 16-year follow-up data since recruitment in 2006. Routine NHS data are obtained through collaboration with the Health Informatics Centre at the University of Dundee, with linkage performed using the Community Health Index (CHI) number. CHI numbers are used across NHS Scotland services and are unique to each general practice (GP)-registered individual living in Scotland. In total, 93% of GS participants consented to linkage and had a CHI number available. For individuals with CHI linkage, 89% also have genome-wide genotype data available (see Laboratory samples and molecular assays below).
Table 2 and figure 1 show the range of datasets linked to the GS cohort, their periods of coverage and the numbers of participants with linked data available for each. Additional details are provided in online supplemental appendix B. Beyond linkages to hospital episodes, primary care, cancer and death registries and community electronic prescribing, GS has linkage to a range of other datasets via participants’ CHI numbers, including routine laboratory tests, dental data (from the Management Information & Dental Accounting System) and the Scottish Drug Misuse Database, offering unique phenotype information distinct from other population-based cohort research resources. COVID-19 testing, diagnoses and vaccination records are also available for the period of 2020–2022.
Regular data refreshes are received, and new datasets are added to enhance and continue the follow-up of participants over time. Planned additional linkages include incorporating NHS Scotland routine NHS radiology images, including X-rays, CT and MRI scans (Scottish Medical Imaging), imaging reports and retinal scans, which will provide new research opportunities not available in other population-based cohorts. Text-based radiology report linkage has already been applied to a study of stroke phenotyping in GS participants.5
Cohort morbidities
Participant self-reported disease prevalence (at recruitment) is shown in table 3 alongside longitudinal data on morbidities obtained through data linkage to primary care (GP) data, Scottish Morbidity Records (SMR) and National Records of Scotland death records. ICD and Read Codes to define disease prevalence were derived from Gadd et al
6 using CALIBER code lists, detailed in full in online supplemental appendix table C–E. Data are available up to 2020 for GP records, cancer registries and 2022 for hospital admissions (SMR01) and mortality records. Diagnoses of 3006 hypertension, 2197 asthma, 2371 depression, 2558 osteoarthritis and 1701 heart disease cases are reported across all primary and secondary care and mortality linked data sources.
As an extended example, figure 2 shows the proportion of diabetes cases captured in secondary care and deaths records as described above and enhanced with additional data sources available within GS (The Scottish Care Information—Diabetes Collaboration (SCI-DC), the Prescribing Information System (PIS) and routine laboratory testing data). A total of 1861 diabetes (types 1 and 2) cases were recorded in at least one source, cohort prevalence of 7.7%. The SCI-DC captures 74% of all recorded cases. Prescriptions of metformin hydrochloride or insulin within the PIS captured 73% of diabetes cases. Linked routine laboratory testing data contained results for any glycated haemoglobin (HbA1c) tests conducted as an indication of average blood sugar levels. Individuals with percentage HbA1c in blood (HbA1c levels) above 6.5 (48 mmol/mol) were classified as diabetic (28% of cases captured).7 We note that lower proportions of cases within the self-reported source reflects that these were collected at baseline while other sources extend to 2020/2022. The use of a combination of data sources provides an opportunity to capture a range of cases and develop detailed phenotype definitions.
Laboratory samples and molecular assays
Participants who attended a research clinic also provided biological samples (including blood and urine) for genotyping and other assays (n=23 979). Saliva was provided for DNA extraction by a subset of participants not attending a clinic (2608 sent a saliva sample by post) and was used for DNA extraction for an additional 984 participants from whom blood could not be obtained (total 3592). DNA was extracted from blood and saliva for 85% of participants (n=20 471). Basic biochemistry assays were performed on the baseline serum samples measuring creatinine, glucose, potassium, sodium, urea and cholesterol levels. Here, we provide an update on the genotyping methods conducted since baseline collection.
Genomics
Genome-wide genotyping data are available for 20 026 (83%) of the original GS:SFHS participants.4 Samples were genotyped using the Illumina HumanOmniExpressExome8V.1-2_A and HumanOmniExpressExome-8V.1_A and the Beadstudio-Gencall V.3 genotype calling algorithm. Quality control measures were implemented, filtering out samples with a call rate of <98% and SNPs with a call rate of <98%, HWE of <1×10−6 and MAF of ≤1%, leaving 20 026 samples and 630 207 SNPs. Phasing of the genotyped SNPs was carried out using SHAPEIT V.2.8
Genetic profiles have been imputed using three different reference panels: 1000 Genomes,9 Haplotype Reference Consortium10 and Trans-Omics for Precision Medicine (table 4).11 After imputation, further quality control procedures removed duplicate and monomorphic SNPs as well as those with an imputation quality score of <0.4.
Methylomics
DNA methylation (DNAm) data have been generated using the Illumina HumanMethylationEPIC BeadChip array for 18 869 GS samples at >8 50 000 CpG sites, from blood collected at the baseline appointment (2006–2011). At the time of writing, this is the largest DNAm dataset from a single population-based cohort. These samples were processed in four batches between 2017 and 2021 and are referred to as set 1 (n=5087), set 2 (n=459), set 3 (n=4450) and set 4 (n=8873). A subsequent genome-wide DNAm measurement is also available for 880 individuals across set 2 (n=508) and set 3 (n=372), from additional blood collected between 2015 and 2019. The DNAm resource will be described in detail in a separate report. Briefly, quality control was carried out in R using the packages ShinyMethyl and WateRmelon. Probes with a bead count of less than three or a high detection p value (>0.05) in more than 5% of samples were removed. Outlier probes were also removed based on visual inspection of the log median intensity of the methylated versus unmethylated signal per array. Samples were removed where there were sex mismatches or where 1% or more of cytosine–guanine dinucleotides had a high detection p value (>0.05). A superset of 18 869 baseline samples has also been generated from the four individual sets, comprising 831 733 CpGs that passed quality control in all sets.
Proteomics and metabolomics
Protein levels have been quantified in plasma samples from 1065 participants using the 5k SOMAscan V.4 array from SomaLogic. Tandem mass spectrometry has been performed on a subset of 860 participants’ blood samples for which peripheral blood mononuclear cells were available. Quantification of 54 urinary metabolite biomarkers in 2743 GS participants’ samples has been conducted by Nightingale Health using nuclear magnetic resonance.
Recontact studies
Participants provided broad consent permitting use of data and samples for ‘future medical research into health, illness and medical treatment’. This included consent to be recontacted for new studies, which has led to additional data collections since recruitment, summarised in table 5. Data from recontact studies can be linked to GS data and are retained by GS to be made available for other researchers through the GS access process.
The Stratifying Resilience and Depression Longitudinally (STRADL) substudy recruited from the existing GS cohort to subtype major depressive disorder (MDD), using detailed clinical, cognitive and brain imaging assessments. From 2015 to 2017, 9905 GS participants completed a remote depression-focused questionnaire (including psychological resilience, coping style and response to psychological distress) and a subset (n=1189) attended a face-to-face assessment to conduct cognitive testing, multimodal MRI of brain scans (n=1085) and further bio-sample collection.12
In 2016, the DOLORisk study enhanced GS to study neuropathic pain (NP). The study received responses to a survey regarding presence or absence of chronic pain and NP from 7238 of 20 221 members of the GS cohort invited to participate (35.8% response rate), with a follow-up repeat survey (n=5292 responses) after 18 months (table 5).13
GS is a member of the European Prevention of Alzheimer’s Dementia Consortium, an interdisciplinary research initiative with partners across European organisations aiming to improve the understanding of the early stages of Alzheimer’s disease.14 In 2016, 53 GS participants attended a ‘screening visit’ for the collection of fasting blood samples and a brain scan (MRI) with follow-up visits after 6 months, 1, 2, 3 and 4 years.
GS is partnering with Healthy AGeing in Scotland (HAGIS), a study of the health, economic and social circumstances of people over 50 years old in Scotland.15 HAGIS is part of the Health & Retirement Study family of longitudinal ageing studies, which currently consists of longitudinal ageing studies in 16 countries around the world. GS recontacted 14 891 individuals in 2021–2022, with 2826 (19.0%) taking part in the HAGIS: COVID-19 Impact & Recovery Study.
Additional data collections conducted by the GS team include the COVIDLife surveys launched in April 2020 in response to the COVID-19 pandemic. The aim was to determine the impact of the pandemic on health and well-being. In total, 18 518 adult members of the UK public, including 4968 GS participants, participated in the surveys. Three COVIDLife surveys16 and a Rural COVIDLife survey,17 specific to rural Scottish volunteers, were conducted (total n=3365, GS participants n=712). In addition, three TeenCOVIDLife surveys,18 for young people aged 12–18 years (n=7058), were run between April 2020 and June 2021. GS was part of the National Core Studies Longitudinal Health and Wellbeing programme established as part of the UK’s pandemic response, including the coronavirus post-acute long-term effects: constructing an evidence base (CONVALESCENCE) long COVID study.19 GS is also a participating cohort in COVIDMENT, a large-scale collaborative project between Northern European countries using data-rich population-based registry resources, biobanks and ongoing questionnaire data to further understanding of the mental health impact of the COVID-19 pandemic.20
New recruitment and data collection plans
In 2019, funding was obtained from the Wellcome Trust to expand the GS cohort using remote data collection and extended eligibility to younger individuals (12+ years). Because of the COVID-19 outbreak in 2020, field studies other than those directly relating to the pandemic were paused. Active recruitment of new volunteers to join GS started in May 2022. Original GS cohort members have been contacted with the option to move online to complete new questionnaires and invite friends and family members to join the next phase of the study (snowball recruitment). Other recruitment methods to date have included: email invitations to Scotland-based participants of the COVIDLife study, news coverage (TV segments, radio, newspaper and online news articles), a paid TV advertisement and social media advertising.
NGS aims to recruit 20 000 new participants and will use established methods for linkage to routine NHS data to create a larger, richer, longitudinal resource. Anyone living in Scotland aged 12 years and over is eligible to join; those aged 12–15 years require parental confirmation of their capacity to consent. Participants sign-up on our online portal, complete study consents and a baseline questionnaire to collect lifestyle measures and medical history.
Saliva samples are being collected by post for genotyping of new participants. At the time of writing, over 10 000 new participants have been recruited, adding to the 2006–2011 cohort recruits.
Adolescence and early adulthood are critical periods in the development of mental and physical health.21 The extension to younger individuals, along with potentially other family members, will make the cohort a valuable resource for research into genetic and environmental determinants of health among adolescents and young adults. There are few comparable genetic cohorts using routine data linkage in young people. Early approved studies are planned to focus on mental health, sleep and loneliness in this age group.
New questionnaires will be regularly added to the online portal to enable ongoing engagement with participants and collect enhanced data such as cognitive testing. Researchers will be able to submit approved research questions for prospective data collections. Through a broad range of recruitment strategies, participant involvement and engagement and the use of remote data collection, we hope to improve geographic coverage and sociodemographic diversity across Scotland, aiming to engage groups typically under-represented in large-scale studies. Completion of the expansion phase, combined with the original GS participants, should create an overall cohort of over 40 000 individuals across Scotland with rich genetic and phenotypic data.
Participant and patient involvement
A key component of the GS:SFHS was to conduct a public consultation programme, which was used to ask the public their thoughts on genetics in healthcare and research and use this to develop principles of participation and data access.22 23 Regular newsletters are distributed to participants to provide updates on the latest cohort information and recent findings. Patient and public involvement and engagement is being developed within the new NGS cohort recruitment. A survey receiving 1000 responses invited participants to become GS ambassadors in their local area, take part in focus groups and test new questionnaires. These volunteers have already helped with survey testing and provided feedback on recruitment materials. Development of a Young Persons Advisory Group has helped direct teen recruitment activities and will shape future GS health research with teenagers themselves.
Findings to date
The GS cohort has facilitated research contributions to a wide range of health conditions and scientific areas including ageing, cancer, cardiovascular disease, mental health and the role of DNAm in understanding and predicting disease. Over 350 papers have been published using GS data (figure 3). Online supplemental appendix F lists the 50 most cited papers using GS data, and a full and growing publication list can be found on the GS website (https://www.ed.ac.uk/generation-scotland
/what-found/publications). Examples of some key contributions are summarised below.
Welsh et al used residual blood samples from GS participants (n=19 501) to assay cardiac troponin T (cTnT) and cardiac troponin I (cTnI), proteins essential for heart contraction, and to investigate their association with cardiovascular outcomes.24 The research team identified deaths or hospitalisations of interest using Scotland’s morbidity records and deaths data from GS recruitment to September 2017. They found that cTnT and cTnI were both associated with heart failure and cardiovascular disease death. Individuals with high levels of cTnT were more likely to suffer from heart disease, stroke or other heart conditions. Troponin level testing is inexpensive, and this study demonstrated the potential benefit of testing for future health screening.
Given its size, the GS DNAm resource is well placed to serve as a training dataset for the development of risk predictors. Cheng et al used GS methylation data to develop and validate a model for 10-year risk prediction of type 2 diabetes.25 They combined standard risk prediction information such as age, sex, body mass index and family history of the disease with DNAm data, which improved prediction for likelihood of developing diabetes. The results were tested using a hypothetical screening scenario of 10 000 people, which correctly classified an additional 449 individuals using methylation data compared with traditional risk factors alone.
Green et al investigated the aetiology of MDD among individuals from the STRADL recontact study.26 They reported the associations of serological and methylomic signatures of C reactive protein (considered to represent acute and chronic measures of inflammation, respectively) with depression status/symptoms and structural neuroimaging phenotypes. The study provided evidence for the involvement of peripheral inflammation in brain morphology and depression symptoms and demonstrated the combined use of survey, neuroimaging, serological and methylation data from the GS cohort.
GS facilitated a pilot study to investigate the use of newborn blood spots in longitudinal research. Heel prick blood spots are used routinely to test for treatable neonatal metabolic conditions and have been retained in Scotland for all children born since 1965. Researchers showed that archival blood spots contain enough information to link to the volunteer health records, and samples were of sufficient quality to generate biologically meaningful results.27 For example, epigenetic signatures of perinatal maternal smoking status could be identified. This pilot study confirmed the feasibility of the use of these archived newborn blood spots in a population-level retrospective birth cohort study. It has the potential to scale to a linked collection of 3 million archived blood spots across Scotland, making it one of only two such resources available worldwide.28 Future work is dependent on a Scottish government-led public consultation to review the current pause on research access.29
Strengths and limitations
Important strengths of GS are the breadth of demographic, lifestyle and health factors, and inclusion of participants from a wide range of sociodemographic backgrounds. The cohort is rich in genetic and linkage data. Scotland is ideally suited to a longitudinal cohort study given its comparatively static and stable population and relatively high prevalence of common conditions and adverse lifestyle risk factors.1 2 The family-based recruitment approach delivers increased kinship among participants and pedigree mapping enables measurement of heritability and familial aggregation of traits.
Linkage to a variety of routine NHS datasets creates a wealth of research opportunities, while participants’ consent for future recontact studies provides potential for additional data collections. Using linkage to gather longitudinal data makes the cohort more robust to attrition as passive linkage allows us to link to new data even if a participant does not take part in future data collections such as recontact studies. Planned linkages to routine NHS medical images, radiology reports and administrative data, such as education, income and benefits, will provide uniquely rich information about the participants and its relationship with future health and well-being, further enhancing the research potential of the cohort.
There are some limitations of the GS cohort. The cohort is relatively small, by contemporary standards for population-based cohorts, which can limit the statistical power to address some research questions definitively (eg, to study rare diseases or small effect sizes). However, this issue can often be addressed through joint analyses with other population-based cohorts and participation in genetic data consortia, which GS actively contributes to. The current expansion of the cohort will also help to address this limitation. Many phenotypes are assessed using self-reported measures which may be subject to recall or response bias. These potential biases are minimised in GS by using validated questionnaires applied widely in research and confirmation of outcomes through linkage to medical records. Compared with the Scottish population, individuals in the cohort are generally older, more likely to be female and less socially deprived. This may limit the power of research studies to pick up relationships with health outcomes and factors such as education/deprivation at the lowest ends of the scale. However, it is hoped that increased diversity of the cohort will be achieved with the current expansion to reach a total cohort size of over 40 000 individuals in Scotland. GS aims to be the UK’s largest multigenerational longitudinal life-course study of genetic, epigenetic, clinical, lifestyle and environmental health determinants.
Data access and collaborations
Researchers can submit proposals to access GS data and samples through our website (https://www.ed.ac.uk/generation-scotland
/for-researchers). This also includes data from recontact studies which can be accessed through a single application to GS. Research proposals are subject to review by the GS access process, under the guidance of the scientific steering committee, based on criteria set out in the management, access and publications policy. We welcome proposals for data and sample access and for prospective data collections using the NGS online portal. Further information about the cohort, details of the application process and conditions for access is available at the study website.
GS also collaborates with—and makes its data and/or metadata available via—the Dementias Platform UK (DPUK), UK Longitudinal Linkage Collaboration (UK LLC), CLOSER, BC Platforms and Health Data Research UK (HDR UK) Innovation Gateway. Access requests can be made through DPUK and UK LLC using the standard GS access process as well as directly to GS. All applications via these platforms are reviewed by the GS access process. Study metadata is available through CLOSER Discovery, BC Platforms and HDR UK.
GS genetic data have contributed to large-scale consortia including Cohorts for Heart and Aging Research in Genomic Epidemiology,30 Chronic Kidney Disease Genetics,31 Genetic Investigation of ANthropometric Traits,32 SpiroMeta,33 Global Biobank Meta-analysis Initiative,34 COVID-19 Host Genetics Initiative,35 Global Lipids Genetics Consortium36 and The Psychiatric Genomics Consortium.37
This post was originally published on https://bmjopen.bmj.com