Discussion
In this individual-level analysis of a large prospective cohort, we developed a novel ML-based tool to predict the 10-year risk of incident CVD. From a massive number of health-related variables, we employed a series of data-driven selection schemes and identified the 10 most important predictors. The proposed model of UKCRP yielded an AUC of 0.762 for CVD, outperforming multiple existing clinical models. The UKCRP was well-calibrated with excellent agreement between predicted risks and observed proportions of events. Its deployment to the prediction of MI and IS achieved comparable performance, but inferior performance for HS. Added values of genetic information of PRS did not observe significant improvement in model discriminations. Our proposed risk tool is easy to implement in practice and will optimise the identification of suspected individuals to aid clinical decision-making.
As to the deployment of subdiagnostic groups, the proposed UKCRP model demonstrated consistent results with existing models of AHA/ASCVD, FGCRS, QRISK V.3 and SCORE V.2 that exhibited the best predictive ability for MI, followed by IS, and the worst for HS. The actual incidence of MI, IS and HS ranging from high to low may be partly responsible for this result. Given the heterogeneity of disorders, distinct models leveraging disease-specific training were indeed demonstrated to be superior to the UKCRP model. In subgroup analysis, we observed gradually decreased AUCs along with each 5-year increase in age. This indicates that the association between risk factors and incident CVD may be stronger in younger people, as supported by recent publications.29–31 Although the prediction accuracy reduced with the increment of incident years, the AUC of 10-year CVD risk remained above 0.76. Accordingly, our model is robust enough to predict short-term and long-term CVD risk. In line with former studies,32 33 we subsequently demonstrated that the effect of PRS addition on risk discrimination improvement was trivial. This further highlights the real-world transportability of our proposed model, which could achieve good predictive performance using only routinely available parameters.
Overcoming the weakness of previous algorithms incorporating only a few traditional predictors,5 7 10 11 34 our predictor selection pipeline allows identifying significant predictors from 645 variables. All the top 10 predictors for model development can be easily obtained through quick questionnaires or blood sampling, which provides the general population with the opportunity to perform automated and rapid health screening. Advanced age and male sex are the two most critical risk factors, with a combined AUC of around 0.7. As previously reported,7 10 treated hypertension, SBP and ratio of total cholesterol/high-density lipoprotein cholesterol played imperative roles in the prediction of CVD risk. In addition, our model included cholesterol medication. Considering that a subset of the population in the study cohort may have already initiated preventive therapies (eg, statins or antihypertensive medication), the incorporation of drug usage could improve the modelling accuracy.7 Taking multiple medications, often driven by managing multiple comorbidities, is common in older CVD patients and has been linked to increased risk of CVD outcomes and adverse consequences such as disability, hospitalisation and death.35–37 Thereby, deprescribing has been an accumulating focus in clinical settings to minimise tangible harm. Ascertaining the predictive value of prior anginal or heart attack and chest pain or discomfort is pivotal, as patients with these symptoms often seek emergency or outpatient assistance and have a greater willingness to engage in proactive risk factor management.38 Cystatin C is a predictor of cardiovascular risk39; however, it has rarely been adopted into previous predictive models. Current smoking status has been frequently reported as a risk factor; in this study, we leveraged pack-years of smoking, which is a derived variable calculated using the number of cigarettes smoked per day and years smoked, and it was found to be more sensitive than simply using a binary variable of smoking status. Overall, the predictors derived in our data-driven pipeline have been validated by numerous studies, proving the reliability of our model; however, it is the first time that the ten predictors were combined to establish a CVD risk prediction model.
The UKCRP model developed in this study can serve as a tool for CVD prediction to evaluate those suspected individuals who may benefit from effective preventive measures. Individuals with a higher CVD risk (eg, a 10-year risk>20%) require more aggressive risk factor interventions. The strategies may include maintaining cholesterol at a reasonable level, intensive blood pressure control, rational and standardised drug use, lowering cystatin C and smoking cessation. Moreover, our study revealed the heightened CVD risk in young adults. Because most young people at high risk tend to ignore potential health hazards, there is a need to raise their self-awareness of the condition and encourage more rigorous interventions or treatments as early as possible to reduce the burden of CVD. Pending external validation, the UKCRP Score is promising not only to help physicians assess CVD risk and make appropriate clinical decisions but also to monitor preventative or therapeutic effectiveness.
One notable strength of our study is that the combination of included predictors was carefully selected from a comprehensive and massive multidimensional variable space, and the predictors used for modelling were easily accessible and proven to be reliable. The powerful LGBM algorithm we used could perfectly fit the enormous datasets and better deal with the missingness and potential nonlinear interactions compared with traditional Cox regressions. The development of the UKCRP model was underpinned by exceedingly thorough and extensive data of contemporary relevance to European populations, comprising over half a million participants with prolonged follow-up. The above characteristics improve the accuracy, versatility and validity of the model.
Several caveats should be concerned. First, the UK Biobank individuals suffer a lower CVD risk relative to the general primary care population. Prior to widespread implementation, the model needs to be recalibrated using related datasets such as the UK Clinical Practice Research Datalink. Second, despite that the UKCRP model was well calibrated over spatially different recruitment centres, its value in the pragmatic clinical application should be verified in entirely independent prospective cohorts to ensure that such implementation does improve patient outcomes. Third, because the population of the UK Biobank is predominantly white, the generalisability of the model across ancestrally distinct individuals will help to determine whether more appropriate and ethnically relevant decisions are required.