Development, validation and comparison of multivariable risk scores for prediction of total stroke and stroke types in Chinese adults: a prospective study of 0.5 million adults

Background and purpose Low-income and middle-income countries have the greatest stroke burden, yet remain understudied. This study compared the utility of Framingham versus novel risk scores for prediction of total stroke and stroke types in Chinese adults. Methods China Kadoorie Biobank (CKB) is a prospective study of 512 726 adults, aged 30–79 years, recruited from 10 areas in China in 2004–2008. By 1 January 2018, 43 234 incident first stroke cases (36 310 ischaemic stroke (IS); 8865 haemorrhagic stroke (HS)) were recorded in 503 842 participants with no history of stroke at baseline. We compared the predictive utility of the Framingham Stroke Risk Profile (FSRP) with novel CKB stroke risk scores and included recalibration, refitting, stratifying by study area and addition of other risk factors. Discrimination was assessed using area under the receiver operating characteristic curve (AUC) and calibration was assessed using Greenwood-Nam-D’Agostino χ2 statistics. Results Incidence of total stroke varied fivefold by area in China. The FSRP had good discrimination for total stroke (AUC (95% CI); men: 0.78 (0.77 to 0.79), women: 0.77 (95% CI 0.76 to 0.78)), but poor calibration (χ2; men: 1,825, women: 3,053), substantially underestimating absolute risks. Recalibration reduced χ2 by >80%, but did not improve discrimination. Refitting the FSRP did not materially improve discrimination, but further improved calibration. Stratification by area improved discrimination (AUC; men: 0.82 (0.82 to 0.83); women: 0.82 (0.82 to 0.83)), but not calibration. Adding other risk factors yielded modest, but statistically significant, improvements in the AUCs. The findings for IS and HS were similar to those for total stroke. Conclusions The FSRP reliably differentiated Chinese adults with incident stroke, but substantially underestimated the absolute risks of stroke. Novel local risk prediction equations that took account of differences in stroke incidence within China enhanced risk prediction of total stroke and major stroke pathological types.


eMethods II. Definitions of Stroke Types
Ischemic stroke (ICD-10 code I63), including lacunar infarction and non-lacunar infarction, was defined as a focal neurological dysfunction lasting for more than 24 hours with or without neuroimaging evidence of a cerebral infarct.
Hemorrhagic stroke was defined to include intracerebral hemorrhage (ICD-10 code I61) and subarachnoid hemorrhage (ICD-10 code I60). Intracerebral hemorrhage was defined as neurological dysfunction caused by hemorrhage into the brain parenchyma or the ventricular system, excluding those induced by injury, with or without neuroimaging evidence of brain hemorrhage. Subarachnoid hemorrhage was defined as neurological dysfunction caused by hemorrhage into the subarachnoid space, excluding those induced by injury, with or without neuroimaging evidence of such hemorrhage.
All fatal and non-fatal stroke cases were coded using ICD-10 by trained medical staff, who were blinded to other personal information, with further checking and review conducted centrally by trained medical staff. All hospital-reported cases of first stroke also underwent additional clinical adjudication, involving retrieval and review of original medical records and brain imaging reports by clinical specialists in China using a bespoke web-based system. About 92% of the reported first stroke cases had their diagnosis confirmed by brain imaging (CT or MRI). Radiological reports (but not primary brain images) of reported cases of non-fatal stroke were adjudicated by Chinese neurologists using a bespoke online system. 1 BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s)

eMethods III. Testing of Proportional Hazards Assumption
The Proportional Hazards (PH) assumption for the FSRP inputs was checked using the Cox PHFitter.check_assumptions method implemented by the lifelines package 2 version 0.21.1 in Python version 3.7.0. This method performs a statistical test to test for any time-varying coefficients, and provides visual plots of the scaled Schoenfeld residuals presented against four time transformations for any risk factor that violates the PH assumption. In each plot, a fitted lowess is also presented, along with 10 bootstrapped lowess lines. Deviations of the lowess line from a constant value are violations of the PH assumption.
Tests of the PH assumption were performed for each FSRP risk factor and separately by sex for CKB participants in the training set (174,499 men; 253,766 women). In CKB men, anti-hypertensive treatment (p<0.005), systolic blood pressure for individuals without hypertension treatment (p<0.005), and systolic blood pressure for individuals with hypertension treatment (p=0.02) were identified to violate the PH assumption (using a p-value threshold of 0.05). In CKB women, the risk factors identified to violate the PH assumption were age (p<0.005), diabetes if under 65 years (p<0.005), diabetes if 65+ years (p=0.01), anti-hypertensive treatment (p<0.005), systolic blood pressure for individuals without hypertension treatment (p<0.005), and systolic blood pressure for individuals with hypertension treatment (p=0.04).
It is important to note that with a large sample size, such as in CKB, even very small violations of the Proportional Hazards Assumption would test as statistically significant. To observe the impact of this effect, we repeated tests of the PH assumption for a randomly selected 10% of CKB men and women (17,449 men; 25,376 women). In men, only systolic blood pressure for individuals without hypertension treatment (p=0.02) and systolic blood pressure for individuals with hypertension treatment (p=0.01) were still identified to violate the PH assumption. However, even among these risk factors, the Schoenfeld residual plots (below) showed a lowess line with very minor deviation from a constant value.
In women, only anti-hypertensive treatment (p=0.02) and systolic blood pressure for individuals without hypertension treatment (p=0.03) were still identified to violate the PH assumption. Once again, the Schoenfeld residual plots for these risk factors (below) showed a lowess line with very minor deviation from a constant value. The results of these tests suggest that while the hazards for every FSRP risk factor may not be perfectly proportional, the PH assumption may still be appropriate in this setting.

eMethods IV. Model Development and Statistical Analyses
Training Set and Test Set Split The CKB data was divided into a training set and test set using a random 85%/15% training/test split, stratified by occurrence of the relevant endpoint (i.e., total stroke, IS, or HS) within 9 years of the baseline survey.

Missing Values
Missing values in both the training set and test set were imputed using the means of the non-missing values in the training set. CKB has very few missing values. Of the 133 risk factors considered in CKB (listed with definitions in eWorkbook 1), only 16 risk factors had missing values. Out of all 503,842 individuals included in the present analyses, 521 individuals had missing values for number of siblings and siblings' medical history (stroke, heart attack, diabetes, and cancer); 817 had missing values for mother's medical history (stroke, heart attack, diabetes, and cancer); 1,124 had missing values for father's medical history (stroke, heart attack, diabetes, and cancer); 2 had missing values for weight and BMI; and 228 had missing values for body fat percentage. In addition to mean imputation, 3 binary risk factors were added to represent whether or not an individual was missing medical history for their mother, father, or siblings, respectively. This resulted in a total of 133 potential CKB inputs.

Model Construction
All of the risk equations evaluated in the present analyses were developed using the Cox proportional hazards regression model. Hazard ratios for risk factors were derived from Cox regression while the baseline hazard function (and corresponding baseline survival function) were derived from the non-parametric Breslow estimator. 3,4 For model construction, recalibration of the baseline survival function and refitting of hazard ratios were conducted in the training set. A summary of the recalibration and refitting methodology is provided in the following diagram, and further details are provided below. Aggregate Baseline w/ FSRP Inputs: For the following models, a single "aggregate" baseline survival function was implemented for all individuals in the study population. We describe this as an "aggregate" baseline because the survival function was based on a combination of all individuals regardless of their area.
2017 FSRP ("2017 FSRP"): For these models, the published 2017 FSRP baseline survival function and FSRP hazard ratios were used without performing any recalibration or refitting procedures.
Recalibrated and Refitted FSRP ("+ Refitting"): For these models, an aggregate CKB baseline survival function was derived and new hazard ratios were refitted for the FSRP risk factors.
Area-specific Baselines w/ FSRP Inputs ("+ Area stratification"): For these models, separate baseline survival functions were developed for each CKB area (10 in total) and new hazard ratios were refitted for the FSRP risk factors using an area-stratified Cox model.

Area-specific Baselines w/ CKB Inputs ("+ Additional risk factors"):
For these models, separate baseline survival functions were developed for each CKB area (10 in total). However, rather than limiting the model to FSRP inputs, 10-fold cross-validated LASSO regularization was used (within the training set) for selecting a subset of risk factors from all CKB variables. LASSO regularization was performed separately for each model, yielding slightly different numbers of selected risk factors. The specifics of the variable selection process have been previously described. 5,6 Hazard ratios for the selected risk factors were then fitted using an area-stratified Cox model on the complete training set.

Risk Equations for IS and HS:
For these models, identical procedures were performed as described above, replacing total stroke with IS and HS. However, since the FSRP is only provided as a total stroke model, the 2017 FSRP baseline survival function and hazard ratios for total stroke were also used for the IS and HS risk equations.

Model Evaluation
All models were internally validated using the test set. Risk discrimination was evaluated using the area under the receiver operating characteristic curve (AUC), and calibration was evaluated using the Greenwood-Nam-D'Agostino chi-squared statistic (χ2). Mean values and 95% confidence intervals were determined using 1000 bootstrapped samples from the test set. For comparison, the training set, test set, and bootstrapped samples were designed to be identical for all models predicting the same endpoint (i.e, all total stroke models, all ischemic stroke models, and all hemorrhagic stroke models).

Model Reporting
In order to avoid over-optimism, all reporting about the performance of the risk prediction equations, including AUC and χ2 metrics, were based on the performance of these models in the test set only.
However, in order to capture information from all CKB individuals in our final models, we report hazard ratios in Figure 2 and eFigure II, risk predictions in eFigure III, and full risk equations in eWorkbook I after reconstructing each model using the overall study population.

Calculating a Stroke Risk Prediction for an Individual
We include an interactive tool for exploring the risk equations reported in this study. To determine an individual's 9yr predicted risk of stroke, refer to the attached "Stroke Risk Calculator" (Online supplemental file 2). Using the "Model Selection Options" section in columns A and B, use the provided dropdown menus to select the model of interest.
 Select Prediction: Allows user to select between "9Yr Total Stroke Risk", "9Yr Ischemic Stroke Risk", and "9Yr Hemorrhagic Stroke Risk"  Select Sex: Allows user to choose between "Male and "Female" options.  Select Model: Allows user to select model-of-interest from the present study  Select Geographic Area: Allows user to select area of individual. If the selected model-of-interest does not require the individual's area, "N/A" will be the only available option.
After completing the "Model Selection Options", refer to columns D and E for the "Individual Risk Factor Values" section. Depending on the selected model, corresponding risk factor prompts will appear in column D. Enter the individual's risk factor values in column E. If any prompts are cut-off, you may need to rewrap the text in the cell. This can be done by highlighting the relevant cell and toggling the "wrap text" button in the "Home" tab of Excel. Once all risk factor values are entered, a calculated predicted 9-yr risk will be presented in the "Model Output" section in columns G and H.

Additional Information on Reported Models
The worksheet titled "Model Params -HRs" includes a full listing of all CKB variables, the selected risk factors used in each model (displayed in green), and their corresponding hazard ratios.
The worksheet titled "Model Params -Base Surv" includes the baseline survival function, evaluated at 9 years from time-of-prediction, for each model.
The worksheet titled "Model Params -Means" includes, for all CKB variables, the average risk factor values for the CKB men and women included in the present study. If any risk factor values are unknown for an individual, refer to this worksheet and enter the corresponding mean value into the "Stroke Risk Calculator" worksheet.
The worksheet titled "Calculation" pulls in the appropriate hazard ratios, mean risk factor values, individual risk factor values, and baseline survival function values corresponding to the user's input on the "Stroke Risk Calculator" worksheet. It then walks through the calculations performed by the selected model to generate a risk prediction, which is outputted in cell B16.
The worksheet titled "Risk Factor Definitions" includes the full list of CKB variables and their definitions.
The worksheets titled "Risk Factor Qs" and "Options" are backend sheets used for displaying the appropriate risk factor questions and selection options, respectively, on the interactive "Stroke Risk Calculator" worksheet.

eFigure I. Area-specific and Aggregate CKB Baseline Survival Curves for First Total Stroke, First Ischemic Stroke, and First Hemorrhagic Stroke in Men and Women
Note: Area-specific baseline survival functions are dependent on the risk factors included in the model as well as their corresponding hazard ratios. In this figure, the area-specific baseline survival functions are shown for recalibrated and refitted models with FSRP inputs only ("+Area stratification" models).  Note: Both Cox and Fine-Gray models were specified with FSRP inputs. Hazard ratios for the Cox models and subdistribution hazard ratios for the Fine-Gray models are reported in eWorkbook 1.