Background and purpose Previous studies, mostly focusing on the European population, have reported polygenic risk scores (PRSs) might achieve risk stratification of stroke. We aimed to examine the association strengths of PRSs with risks of stroke and its subtypes in the Chinese population.
Methods Participants with genome-wide genotypic data in China Kadoorie Biobank were split into a potential training set (n=22 191) and a population-based testing set (n=72 150). Four previously developed PRSs were included, and new PRSs for stroke and its subtypes were developed. The PRSs showing the strongest association with risks of stroke or its subtypes in the training set were further evaluated in the testing set. Cox proportional hazards regression models were used to estimate the association strengths of different PRSs with risks of stroke and its subtypes (ischaemic stroke (IS), intracerebral haemorrhage (ICH) and subarachnoid haemorrhage (SAH)).
Results In the testing set, during 872 919 person-years of follow-up, 8514 incident stroke events were documented. The PRSs of any stroke (AS) and IS were both positively associated with risks of AS, IS and ICH (p<0.05). The HR for per SD increment (HRSD) of PRSAS was 1.10 (95% CI 1.07 to 1.12), 1.10 (95% CI 1.07 to 1.12) and 1.13 (95% CI 1.07 to 1.20) for AS, IS and ICH, respectively. The corresponding HRSD of PRSIS was 1.08 (95% CI 1.06 to 1.11), 1.08 (95% CI 1.06 to 1.11) and 1.09 (95% CI 1.03 to 1.15). PRSICH was positively associated with the risk of ICH (HRSD=1.07, 95% CI 1.01 to 1.14). PRSSAH was not associated with risks of stroke and its subtypes. The addition of current PRSs offered little to no improvement in stroke risk prediction and risk stratification.
Conclusions In this Chinese population, the association strengths of current PRSs with risks of stroke and its subtypes were moderate, suggesting a limited value for improving risk prediction over traditional risk factors in the context of current genome-wide association study under-representing the East Asian population.
- Prospective Studies
Data availability statement
Data are available on reasonable request. Details of how to access China Kadoorie Biobank data and details of the data release schedule are available from www.ckbiobank.org/site/Data+Access.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
WHAT IS ALREADY KNOWN ON THIS TOPIC
Polygenic risk scores (PRSs) might achieve risk stratification of stroke.
Evidence from the East Asian population (including Chinese) is lacking.
WHAT THIS STUDY ADDS
The association strengths of current PRSs with risks of stroke and its subtypes were moderate in the Chinese population.
PRS for ischaemic stroke was positively associated with the risk of intracerebral haemorrhage.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
In the Chinese population, current PRSs might have limited value for improving stroke risk prediction over traditional risk factors.
Further studies are warranted to assess whether new PRSs based on larger genome-wide association study or other developing methods have considerable potential to translate into population health benefits.
Stroke is one of the leading causes of death and disease burdens globally.1 Stroke includes two main subtypes, such as ischaemic stroke (IS) and haemorrhagic stroke (HS). The latter could further be divided into intracerebral haemorrhage (ICH) and subarachnoid haemorrhage (SAH). With the accumulation of genomic data worldwide, the genetic background of stroke and its subtypes is gradually being revealed. Polygenic risk score (PRS), a method used to combine minor genetic effects across the whole genome, has been increasingly used in stroke research. Several studies based on European populations have developed PRSs for any stroke (AS) or IS and suggested their potential to improve risk prediction and risk stratification.2–9 The incidence of stroke in China, especially ICH, is higher than in Western countries.1 Recently, a PRS for AS was developed based on the Chinese population and showed similar association strength in predicting the risk of IS and HS.10 However, IS and HS might have different aetiological mechanisms.11–13 Different stroke subtypes also have their specific genetic loci.14 No study has specifically developed PRSs for subtypes of stroke in the Chinese population.
The present study was based on a subcohort with genomic data from the China Kadoorie Biobank (CKB). We aimed to examine the association strengths of PRSs with risks of stroke and its subtypes in the Chinese population.
CKB is an ongoing prospective study with 512 724 participants aged 30–79 enrolled from five urban and five rural regions in China between 2004 and 2008. Details of the study have been described elsewhere.15
Among all CKB participants, there are 100 639 participants with genome-wide genotypic data. Of them, 24 657 participants were selected based on a case–control design nested within the cohort with the primary aim of studying CVD (‘case–control samples’), which formed four matched-case-control training sets (figure 1A, online supplemental methods, tables 1 and 2). The other 75 982 participants were randomly selected from the entire CKB cohort (‘population-based samples’); after excluding participants with self-reported coronary artery disease or stroke or transient ischaemic attack at baseline (n=3832), the remaining participants were used as a ‘testing set’ (n=72 150) (figure 1A, online supplemental methods).
The current study can be divided into four parts (figure 1B). (1) Validation of previous PRSs. Four previously reported stroke-related PRSs were selected for validation.2 4 5 10 (2) Development of new PRSs. Clumping and thresholding (‘C+T’) and LDpred16 were used to develop new PRSs for stroke and its subtypes based on two genome-wide association studies with large sample sizes.14 17 (3) Identification of the optimal PRS for each outcome. The performances of different PRSs in predicting each outcome were compared in the corresponding training sets. (4) Validation and evaluation of the optimal PRS for each outcome. We prospectively examined the associations between optimal PRSs and risks of stroke and its subtypes. We evaluated the impact of PRSs on the risk prediction improvement by adding the optimal PRS to traditional risk prediction models in the testing set.
Assessment of traditional stroke risk factors
The baseline questionnaire collected information on sociodemographic characteristics, lifestyle behaviours, dietary habits, and personal and family medical history.15 Traditional stroke risk factors considered in the present study included sex, age, systolic and diastolic blood pressure (SBP and DBP), smoking, body mass index (BMI), waist circumference, hypertension, diabetes and family history of stroke. Details on the collection and definition of these variables have been described in our previous work.18 19
At baseline, a 10 mL random blood sample was collected from each participant. Genotyping and imputation in this study were centrally conducted, with details provided in our previous study.19 20 Briefly, two custom-designed single nucleotide polymorphism (SNP) arrays (Affymetrix Axiom CKB array) were used for genotyping. Imputation was performed based on haplotypes derived from the 1000 Genomes Project Phase 3. There were 9.54 million genetic variants with high reliability (online supplemental figure 1).
Polygenic risk scores
We searched the PGS Catalogue,21 PubMed and Embase. Four previous stroke PRSs were selected for validation analyses (online supplemental methods and table 3).2 4 5 10 Meanwhile, we ran gwasfilter to filter genome-wide association studies (GWAS) from the GWAS Catalogue (https://www.ebi.ac.uk/gwas/).22 23 Based on ethnicity, sample size and accessibility of the summary statistics file (SSF), we finally included one AS SSF, two SAH SSFs, two ICH SSFs and two IS SSFs from two large-scale GWAS (online supplemental methods and table 4).14 17 Similar to our latest research,19 we developed new PRSs by using two methods: clumping and thresholding (‘C+T’) and LDpred16 (online supplemental methods).
Ascertainment of stroke outcomes
All participants were followed up for morbidity and mortality since their baseline enrolment. Incident events were identified by linking with local disease and death registries and the national health insurance database and supplemented by active follow-up.15 In the testing set, only 653 (0.91%) were lost to follow-up before censoring on 31 December 2018. Trained staff blinded to baseline information coded all events using the International Classification of Diseases, 10th Revision (ICD-10). Incident stroke events during the follow-up were defined as I60–I64, including SAH (I60), ICH (I61), other nontraumatic intracranial haemorrhage (I62), IS (I63) and unspecified stroke (I64). In the testing set, the events coded as I62 and I64 accounted for only 0.9% (n=76) and 3.5% (n=302) of all incident stroke events.
Since 2014, medical records of incident stroke cases have been retrieved and reviewed by qualified cardiovascular specialists blinded to baseline information. According to a previous study,24 by October 2018, the reporting accuracy was 91.7%, 90.4% and 82.7% for IS, ICH and SAH24; the corresponding diagnostic accuracy was 93.1% (including silent lacunar infarction), 98.2% and 98.1%, respectively.24
Identification of the optimal PRS in the training set
In each training set, we used the conditional logistic regression model to measure the association of each PRS with the risk of the corresponding stroke outcome, stratified by the case–control pair, with the top 10 principal components of ancestry (PCA) and array versions as the covariates. We defined the optimal PRS as the PRS with the highest OR per SD, as our previous study did.19
Validation and evaluation of the optimal PRS in the testing set
In the testing set, we used the Cox regression model to measure the association of optimal PRSs with risks of stroke and stroke subtypes. The model was stratified by sex and ten study regions, with age as the time scale and adjusting for the top 10 PCA and array versions. We further adjusted for SBP, BMI and family history of stroke in sensitivity analyses. We evaluated the proportional hazards assumptions by examining Schoenfeld residuals. Either non-existent or minimal deviations were observed. In subgroup analyses, the tests for multiplicative interaction were performed using likelihood ratio tests by comparing models with and without cross-product terms between the stratifying variable and PRS.
To evaluate the impact of PRS on risk prediction improvement, we defined the ‘CKB-CVD models’ as the traditional risk prediction models, as our previous study did.19 The ‘CKB-CVD models’ distinguish risks of IS and haemorrhagic stroke and have good discrimination without relying on blood lipids.18 We added the PRS to traditional models to get a ‘PRS-enhanced model’. We assessed the discrimination performance by using Harrell’s C.25 We used the net reclassification improvement (NRI) and integrated discrimination improvement to evaluate model reclassification before and after the addition of PRS.26
The study adhered to the PRS Reporting Standards and statement Strengthening the reporting of observational studies in epidemiology for cohort studies simultaneously (online supplemental file 2).27 28 Analyses were done with Stata (V.17.0, StataCorp) and R (V.4.0.3). All statistical tests were two sided with α=0.05.
Selection of the optimal PRSs in the training sets
In this study, four 1:1 matched training sets were defined to identify the optimal PRS for AS (7412 pairs), IS (3844 pairs), ICH (4296 pairs) and SAH (359 pairs) (figure 1, online supplemental methods). Among the training sets, 72.7%, 61.6%, 77.9% and 63.8% of the participants were from rural areas in China; 51.9%, 50.5%, 53.4% and 38.4% of the participants were men, respectively. Among the cases, the median age of disease onset (25th–75th percentile) was 65.3 (57.0–72.0), 64.1 (56.1–70.6), 65.9 (57.7–73.0) and 61.0 (53.8–69.2) years, respectively. Among all training sets, the proportion of the control group using the first version of the SNP array was lower than that of the case group (p<0.001) (online supplemental table 2). The performance of PRS for AS and IS developed in previous studies was not better than that of the newly developed PRS in the present study (table 1, online supplemental table 5). The optimal PRS for AS came from the LDpred method, and the optimal PRS for IS, ICH and SAH came from the C+T method. The ORSD (95% CI) of the optimal PRSs was 1.14 (1.10 to 1.18) for AS, 1.18 (1.13 to 1.24) for IS, 1.10 (1.05 to 1.15) for ICH and 1.25 (1.06 to 1.47) for SAH (table 1, online supplemental table 5).
Associations of PRSs with stroke and its subtypes in the testing set
The testing set included 72 150 Chinese participants, of which 59.8% were women. The median age was 50.6 years in women and 51.9 years in men. During 872 919 person-years of follow-up (over 12 years on average), 8514 incident stroke events were documented, including 7507 IS, 1193 ICH and 132 SAH (table 2). The correlations among the optimal PRSs were weak (all correlation coefficients<0.2) (online supplemental figure 2).
The PRSAS and PRSIS were both positively associated with risks of AS, IS and ICH (p<0.05). The HRSD (95% CIs) of PRSAS was 1.10 (1.07 to 1.12), 1.10 (1.07 to 1.12) and 1.13 (1.07 to 1.20) for AS, IS and ICH, respectively. The corresponding HRSD (95% CIs) of PRSIS was 1.08 (1.06 to 1.11), 1.08 (1.06 to 1.11) and 1.09 (1.03 to 1.15) (figure 2, online supplemental table 6). PRSICH was positively associated with the risk of ICH in the whole testing set (HRSD=1.07), though it was not statistically significant in women (p for sex interaction=0.056) (figure 2C). PRSSAH was not associated with risks of any outcomes (figure 2). A strong association of PRSAS with the risk of SAH (HRSD=1.38, 95% CI 1.03 to 1.87) was observed in men but not in women (p for sex interaction=0.055) (figure 2D).
In sensitivity analyses, the associations of PRSs with risks of stroke and its subtypes did not change significantly after additional adjustment for SBP, BMI and family history of stroke (online supplemental table 6). In subgroup analyses, there was no strong evidence supporting a different association strength across subgroups for IS and ICH after considering multiple testing (p for interaction>0.05/8) (online supplemental figures 3 and 4).
Addition of the optimal PRS to traditional risk prediction models
Based on the traditional models defined in this study, the addition of the PRS did not improve or only slightly improve the discrimination performance of the models. For IS, the addition of PRSAS increased Harrell’s C by 0.0010 in men (p=0.002). For haemorrhagic stroke, the addition of PRSs did not influence Harrell’s C significantly (p>0.05) (figure 3). The addition of the PRS offered little to no improvement in stroke risk stratification. For example, the categorical NRIs at the 10% high-risk threshold for ischaemic and haemorrhagic stroke were all not significant in both sexes (p>0.05) (online supplemental table 7).
Based on the largest biobank in the Chinese population, only moderate associations were observed between PRSs and risks of stroke and its subtypes in this Chinese population, with an HRSD of about 1.10. The addition of current PRSs offered little to no improvement in stroke risk prediction and risk stratification. We also found that the PRSs developed from GWAS summary statistics of IS were positively associated with the risk of ICH.
In the present study, the associations of PRSs with risks of stroke and its subtypes were moderate, suggesting a limited value for improving risk prediction over traditional risk factors. The HRSD for PRS was usually greater than 1.20 in previous studies of the general population. A PRS for IS (PGS000039) that was developed with the metaGRS method and combined PRSs of 5 stroke subtypes and 14 stroke-related traits had an HRSD of 1.26 (95% CI 1.22 to 1.31) in the European population.5 Another PRS for stroke (PGS002259) was also developed using the metaGRS method in a Chinese population, with the HRSD for stroke being 1.28 (95% CI 1.21 to 1.36).10 However, these two PRSs showed much weaker associations with the risk of stroke or IS in the present study than in previous studies. Since both PRSs were developed using the elastic-net logistic regression, a machine learning approach, the potential overfitting may undermine their generalisation performance.
The incidence rate of ICH is much higher in Chinese than in European populations. However, non-European populations are under-represented in GWAS, which serves as the basis for PRS development. The largest GWAS for ICH included only 3400 ICH cases, with most of them from European populations.17 The present study attempted to develop PRS for ICH based on summary statistics from this GWAS. The weak associations observed in the present study are either explained by the difference in genetic background between ethnic groups or suggest that this GWAS may be underpowered. The stronger association estimate between PRS and HS risk reported in the previous study was likely due to the inclusion of PRSs for risk factors of HS (such as blood pressure) in the metaGRS method.10 It is worth mentioning that, in the present study, the PRSs directly developed from GWAS summary statistics of IS were also positively associated with the risk of ICH. Although there are differences in aetiology and risk factor profile between IS and ICH,11–13 they might also have some partially shared aetiological mechanisms like the cerebral small-vessel disease.29
This study has the following strengths. The large sample size and a large number of stroke events (including IS and ICH) enabled us to separate powerful training sets and the testing set and to conduct subgroup analyses. The lost to follow-up rate was less than 1% at an average follow-up period of over 12 years in CKB. The main subtypes of stroke (ie, IS, ICH and SAH) were well classified, and the reporting and diagnostic accuracy of stroke events were high.24 The genotyping and imputation of genetic data in this study were centrally conducted through a standard quality control process. Genetic variants with high reliability covered the whole genome well.
However, several limitations merit consideration. First, we did not further consider the subtypes of IS (eg, large-atherosclerotic stroke, cardioembolic stroke and small vessel stroke) as over 75% of the incident IS events were coded as unspecified IS (ICD-10: I63.9), which precluded us from conducting more detailed analyses. Previous studies have suggested that there are differences in genetic loci of different IS subtypes.14 30 Subsequent studies can explore whether distinguishing IS subtypes can further improve the predictive ability of PRS for IS. Second, compared with IS and ICH, the number of SAH events was relatively small. Therefore, it is difficult to exclude chance factors for the positive results observed in the present study. Further studies with more SAH events are warranted to examine our findings. Third, the genetic variants with ambiguous SNP (ie, A/T, C/G) and those that were not found in CKB or had low imputation quality scores were removed during the standard quality control process of PRSs. This might weaken the associations of previous PRSs with stroke and its subtypes. Fourth, because information on blood lipids was not available for the current study population, we were unable to compare the impacts of blood lipids and PRS on traditional stroke risk prediction model improvement. However, the addition of blood lipids may enhance the traditional non-laboratory-based models, as previous studies have shown.31 32 Therefore, adding PRS to a ‘lipid-enhanced model’ might lead to a more minor improvement than what we have observed in the present study.
In this Chinese population, the associations of optimal PRSs with risks of stroke and its subtypes were moderate, suggesting a limited value for improving risk prediction over traditional risk factors in the context of current GWAS under-representing the East Asian population. As GWAS of stroke and its subtypes progress among East Asians, further studies are warranted to assess whether new PRSs have considerable potential to translate into precision public health and population health benefits and, if so, to determine the appropriate context for their use.
Data availability statement
Data are available on reasonable request. Details of how to access China Kadoorie Biobank data and details of the data release schedule are available from www.ckbiobank.org/site/Data+Access.
Patient consent for publication
This study involves human participants and CKB had ethical approvals from the Ethical Review Committee of the Chinese Center for Disease Control and Prevention (Beijing, China) (approval notice: 005/2004) and the Oxford Tropical Research Ethics Committee, University of Oxford (UK) (reference: 025-04). Participants gave informed consent to participate in the study before taking part.
The most important acknowledgment is to the participants in the study and the members of the survey teams in each of the 10 regional centres, as well as to the project development and management teams based in Beijing, Oxford and the 10 regional centres.
Collaborators International Steering Committee: Junshi Chen, Zhengming Chen (PI), Robert Clarke, Rory Collins, Yu Guo, Liming Li (PI), Jun Lv, Richard Peto, Robin Walters. International Co-ordinating Centre, Oxford: Daniel Avery, Derrick Bennett, Ruth Boxall, Sue Burgess, Ka Hung Chan, Yumei Chang, Yiping Chen, Zhengming Chen, Johnathan Clarke; Robert Clarke, Huaidong Du, Ahmed Edris Mohamed, Zammy Fairhurst-Hunter, Hannah Fry, Simon Gilbert, Alex Hacker, Mike Hill, Michael Holmes, Pek Kei Im, Andri Iona, Maria Kakkoura, Christiana Kartsonaki, Rene Kerosi, Kuang Lin, Mohsen Mazidi, Iona Millwood, Sam Morris, Qunhua Nie, Alfred Pozarickij, Paul Ryder, Saredo Said, Sam Sansome, Dan Schmidt, Paul Sherliker, Rajani Sohoni, Becky Stevens, Iain Turnbull, Robin Walters, Lin Wang, Neil Wright, Ling Yang, Xiaoming Yang, Pang Yao. National Co-ordinating Centre, Beijing: Yu Guo, Xiao Han, Can Hou, Jun Lv, Pei Pei, Chao Liu, Canqing Yu, Qingmei Xia. 10 Regional Co-ordinating Centres: Qingdao CDC: Zengchang Pang, Ruqin Gao, Shanpeng Li, Haiping Duan, Shaojie Wang, Yongmei Liu, Ranran Du, Yajing Zang, Liang Cheng, Xiaocao Tian, Hua Zhang, Yaoming Zhai, Feng Ning, Xiaohui Sun, Feifei Li. Licang CDC: Silu Lv, Junzheng Wang, Wei Hou. Heilongjiang Provincial CDC: Wei Sun, Shichun Yan, Xiaoming Cui. Nangang CDC: Chi Wang, Zhenyuan Wu,Yanjie Li, Quan Kang. Hainan Provincial CDC: Huiming Luo, Tingting Ou. Meilan CDC: Xiangyang Zheng, Zhendong Guo, Shukuan Wu, Yilei Li, Huimei Li. Jiangsu Provincial CDC: Ming Wu, Yonglin Zhou, Jinyi Zhou, Ran Tao, Jie Yang, Jian Su. Suzhou CDC: Fang Liu, Jun Zhang, Yihe Hu, Yan Lu, Liangcai Ma, Aiyu Tang, Shuo Zhang, Jianrong Jin, Jingchao Liu. Guangxi Provincial CDC: Mei Lin, Zhenzhen Lu. Liuzhou CDC: Lifang Zhou, Changping Xie, Jian Lan,Tingping Zhu,Yun Liu, Liuping Wei, Liyuan Zhou, Ningyu Chen, Yulu Qin, Sisi Wang. Sichuan Provincial CDC: Xianping Wu, Ningmei Zhang, Xiaofang Chen, Xiaoyu Chang. Pengzhou CDC: Mingqiang Yuan, Xia Wu, Xiaofang Chen, Wei Jiang, Jiaqiu Liu, Qiang Sun. Gansu Provincial CDC: Faqing Chen, Xiaolan Ren, Caixia Dong. Maiji CDC: Hui Zhang, Enke Mao, Xiaoping Wang, Tao Wang, Xi zhang. Henan Provincial CDC: Kai Kang, Shixian Feng, Huizi Tian, Lei Fan. Huixian CDC: XiaoLin Li, Huarong Sun, Pan He, Xukui Zhang. Zhejiang Provincial CDC: Min Yu, Ruying Hu, Hao Wang. Tongxiang CDC: Xiaoyi Zhang, Yuan Cao, Kaixu Xie, Lingli Chen, Dun Shen. Hunan Provincial CDC: Xiaojun Li, Donghui Jin, Li Yin, Huilin Liu, Zhongxi Fu. Liuyang CDC: Xin Xu, Hao Zhang, Jianwei Chen,Yuan Peng, Libo Zhang, Chan Qu.
Contributors JL conceived and designed the study. LL, ZC and JC: members of the China Kadoorie Biobank Steering Committee, designed and supervised the whole study, obtained funding, and, together with CY, YG, DiS, YP, PP, LY, YC, HD, YL, SB, DA, IYM and RGW: acquired the data. SY, ZS and DoS analysed the data. SY drafted the manuscript. CY, YP, DiS and RC helped to interpret the results. JL contributed to the critical revision of the manuscript for important intellectual content. All authors reviewed and approved the final manuscript. JL is the guarantor.
Funding This work was supported by the National Natural Science Foundation of China (82192904, 82192901, 82192900). The CKB baseline survey and the first re-survey were supported by a grant from the Kadoorie Charitable Foundation in Hong Kong. The long-term follow-up is supported by grants from the UK Wellcome Trust (212946/Z/18/Z, 202922/Z/16/Z, 104085/Z/14/Z, 088158/Z/09/Z), grants (2016YFC0900500) from the National Key R&D Program of China, National Natural Science Foundation of China (81390540, 91846303, 81941018) and Chinese Ministry of Science and Technology (2011BAI09B01).
Disclaimer The funders had no role in the study design, data collection, data analysis, data interpretation, or writing of the report.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.