Whole genome sequencing of 10K patients with acute ischaemic stroke or transient ischaemic attack: design, methods and baseline patient characteristics

Si Cheng; Zhe Xu; Yang Liu; Jinxi Lin; Yong Jiang; Yilong Wang; Xia Meng; Anxin Wang; Xinying Huang; Zhimin Wang; Guohua Chen; Songdi Wu; Zhengchang Jia; Yongming Chen; Xuerong Qiu; Jun Wu; Binbin Song; Weizhong Ji; Zhongping An; Wenjun Xue; Lili Zhao; Yu Geng; Hongyan Li; Hao Li; Yongjun Wang

doi:10.1136/svn-2020-000664

Article Text

Protocol

Whole genome sequencing of 10K patients with acute ischaemic stroke or transient ischaemic attack: design, methods and baseline patient characteristics

Si Cheng1,2,3,
Zhe Xu1,2,3,
Yang Liu1,2,3,
Jinxi Lin1,2,
Yong Jiang1,2,
Yilong Wang1,2,
Xia Meng1,2,
http://orcid.org/0000-0003-4351-2877Anxin Wang1,2,
Xinying Huang1,2,
Zhimin Wang4,
Guohua Chen5,
Songdi Wu6,
Zhengchang Jia7,
Yongming Chen8,
Xuerong Qiu9,
Jun Wu10,
Binbin Song11,
Weizhong Ji12,
Zhongping An13,
Wenjun Xue14,
Lili Zhao15,
Yu Geng16,
Hongyan Li17,
http://orcid.org/0000-0002-8591-4105Hao Li1,2,
http://orcid.org/0000-0002-9976-2341Yongjun Wang1,2,3

¹ Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
² China National Clinical Research Center for Neurological Diseases, Beijing, China
³ Advanced Innovation Center for Human Brain Protection, Capital Medical University, Beijing, China
⁴ Department of Neurology, The First people's Hospital of Taizhou, Taizhou, China
⁵ Department of Neurology, Wuhan First Hospital, Wuhan, China
⁶ Department of Neurology, The First People's Hospital of Xi'an, Xi'an, China
⁷ Department of Neurology, The Second People’s Hospital of Jinzhong, Jinzhong, China
⁸ Department of Neurology, WuYuan County People’s Hospital, Bayannur, China
⁹ Department of Neurology, Qiqihar City Rongjian Stroke Prevention and Treatment Institute, Qiqihar, China
¹⁰ Department of Neurology, Peking University Shenzhen Hospital, Shenzhen, China
¹¹ Department of Neurology, Luoyang Central Hospital, Luoyang, China
¹² Department of Neurology, Qinghai Provincial People's Hospital, Xining, China
¹³ Department of Neurology, Tianjin Huanhu Hospital, Tianjin, China
¹⁴ Department of Neurology, Pingdingshan First People's Hospital, Pingdingshan, China
¹⁵ Department of Neurology, Changzhi People's Hospital, Changzhi, China
¹⁶ Department of Neurology, Zhejiang Provincial People's Hospital, Hangzhou, China
¹⁷ Department of Neurology, Xinjiang Uygur Autonomous Region People's Hospital, Urumqi, China

Correspondence to Dr Yongjun Wang; yongjunwang{at}ncrcnd.org.cn

Abstract

Background and purpose Stroke is the second leading cause of death worldwide and the leading cause of mortality and long-term disability in China, but its underlying risk genes and pathways are far from being comprehensively understood. We here describe the design and methods of whole genome sequencing (WGS) for 10 914 patients with acute ischaemic stroke or transient ischaemic attack from the Third China National Stroke Registry (CNSR-III).

Methods Baseline clinical characteristics of the included patients in this study were reported. DNA was extracted from white blood cells of participants. Libraries are constructed using qualified DNA, and WGS is conducted on BGISEQ-500 platform. The average depth is intended to be greater than 30× for each subject. Afterwards, Sentieon software is applied to process the sequencing data under the Genome Analysis Toolkit best practice guidance to call genotypes of single nucleotide variants (SNVs) and insertion-deletions. For each included subject, 21 fingerprint SNVs are genotyped by MassARRAY assays to verify that DNA sample and sequencing data originate from the same individual. The copy number variations and structural variations are also called for each patient. All of the genetic variants are annotated and predicted by bioinformatics software or by reviewing public databases.

Results The average age of the included 10 914 patients was 62.2±11.3 years, and 31.4% patients were women. Most of the baseline clinical characteristics of the 10 914 and the excluded patients were balanced.

Conclusions The WGS data together with abundant clinical and imaging data of CNSR-III could provide opportunity to elucidate the molecular mechanisms and discover novel therapeutic targets for stroke.

stroke
genetic

Data availability statement

Data are available upon reasonable request. Data in this article are available upon reasonable request.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

https://doi.org/10.1136/svn-2020-000664

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Stroke is the second leading cause of death worldwide, and the leading cause of mortality and long-term disability in China.1 Being the most common type of stroke, ischaemic stroke (IS) accounts for about 80% of all strokes,2 and more than 90% of IS are sporadic.3 IS is a complex multifactorial disease arising from complicated gene-environment interactions. Therefore, uncovering genetic contributions to IS could help to identify the genes, pathways and networks that are involved in IS pathogenesis. Although several novel genetic variants that were associated with IS susceptibility have been discovered in the last decades,4–9 few studies explored the correlation between genetic variants and stroke outcomes. Moreover, previous genetic studies on IS were mainly conducted in European and African populations,4 10 and there is limited data for the Chinese population. Due to the substantial ancestral difference,11 whether these reported IS-associated genetic variants could also contribute to IS pathogenesis in Chinese population needs verification.

The Third China National Stroke Registry (CNSR-III) is a nationwide prospective registry with 15 166 patients with IS or transient ischaemic attack (TIA) in China.12 A broad and comprehensive spectrum of individual-level data had been collected, including clinical phenotypes, aetiological classification, neuroimaging, biomarkers and clinical outcomes. The aetiological subtyping information was recorded centrally. Taking these advantages, we perform whole genome sequencing (WGS) for 10 914 patients in the prespecified genetic substudy of CNSR-III to delineate the genetic landscape of IS and TIA in Chinese population.

Methods

Patients

The CNSR-III is a nationwide prospective registry for patients presented to hospitals with acute ischaemic cerebrovascular events between August 2015 and March 2018 in China.12 There is a total of 15 166 patients with IS (n=14 146, 93.3%) or TIA (n=1020, 6.7%) within 7 days from the onset of symptoms to enrollment. The CNSR-III involved 201 hospitals that cover 22 provinces and 4 municipalities in China, including 163 grade III (central hospitals for certain district or city, usually teaching hospitals) and 38 grade II (hospitals serving several communities) urban hospitals. A total of 12 603 patients participated in the prespecified genetic substudy. The white blood cells (WBCs) from a total of 10 914 patients are applied in WGS (figure 1). The written informed consents were obtained from all patients or legally authorised representatives before entering into the study.

Figure 1

Flow chart of patient selection for WGS in the prespecified genetic substudy of CNSR-III. CNSR-III, The Third China National Stroke Registry; IS, ischaemic stroke; TIA, transient ischaemic attack; WGS, whole genome sequencing.

DNA extraction

For each sample, WBCs was used to extract the genomic DNA, which was performed using Magnetic Blood Genomic DNA Kit (DP329, TIANGEN Biotech Co Ltd, Beijing, China) on KingFisher Flex (Thermo Scientific Co, Massachusetts, USA) system for automatic genomic DNA extraction and purification at iGeneTech Co Ltd. (Beijing, China) or by manual phenol–chloroform DNA extraction at BGI Genomics (BGI-Shenzhen).

Evaluation of DNA quality

The concentration of genomic DNA was quantified using Qubit 2.0 fluorometer (Thermo Scientific Co, Massachusetts, USA) and SpectraMax Gemini XPS (Molecular Devices, San Francisco, USA) at BGI Genomics (BGI-Shenzhen). Electrophoresis was conducted on 1% agarose gel to make sure that the majority of genomic DNA segments was longer than 20 Kb and was not substantially degraded. Genomic DNA samples with concentration ≥12.5 ng/µL and total amount ≥0.5 µg was qualified for further procedures. For each of the qualified sample, the DNA is further applied in library construction and subsequent WGS process, as well as single nucleotide variant (SNV) genotyping (see details below).

Library construction

The qualified genomic DNA is randomly fragmented by ultrasound using CovarisLE220 (Covaris, Massachusetts, USA) according to the manufacturer’s instructions. The DNA fragments in the range of 200 to 400 bp are selected by VAHTSTM DNA Clean Beads (Vazyme Biotech Co, Ltd, Nanjing, China). The end repair for DNA fragments is performed by adding an ‘A’ nucleotide to the 3’ end of each strand. Afterwards, the dTTP-tailed adapters are ligated to both ends of the repaired/dA-tailed DNA fragments. The ligation product is then amplified by PCR. Then the products are purified by VAHTSTM DNA Clean Beads (Vazyme Biotech Co, Ltd, Nanjing, China). The purified PCR products with total mass ≥200 ng, and the main peak in 300 to 500 bp should be applied. Single strand separation is conducted by heat-denaturing the PCR product at 95 °C. Circularisation process is performed by mixing the single-stranded DNA fragments with splint oligos (sequence: GCCATGTCGTTCTGTGAGCCAAGG) and DNA Rapid Ligase to generate single-stranded DNA circles. The remaining linear molecule is digested with the exonuclease. The enzymatic digestion products are purified by Agencourt AMPure XP medium (Beckman Coulter, Indiana, USA). The single-stranded circle DNA (ssCir DNA) are formatted as the final library. The purified enzymatic digestion products are quantified with Qubit ssDNA Assay Kit (Thermo Scientific Co, Massachusetts, USA), and the final yield should be ≥12 ng.

BGISEQ-500 WGS sequencing

Rolling circle amplification is performed for the qualified libraries to produce DNA Nanoballs (DNBs). Then the DNBs are loaded into the patterned nanoarrays and 100 bp pair-end reads are sequenced on the BGISEQ-500 platform (BGI Genomics, Shenzhen, China). Sequencing-derived raw image files are processed by BGISEQ-500 base-calling software (V.1.2.1.21840) under default parameters settings. The sequence data are stored in FASTQ format. The average depth for each subject is intended to be greater than 30×.

SNV genotyping

To make sure that the DNA samples are neither mistaken nor contaminated during the WGS process, we selected 21 biallelic fingerprint SNVs and planned to genotype them for each participant of WGS. These 21 SNVs distribute on 15 different autosomes and are at least 13M apart. The minor allele frequencies of these SNVs are between 0.16 to 0.5 within the Han Chinese in Beijing samples in 1000 Genome Project.13 The SNV genotyping experiments are performed at BGI Genomics (BGI-Shenzhen) independently and simultaneously with WGS. For each sample, approximately 30 ng of qualified genomic DNA is used. Locus-specific PCR and detection primers are designed using the MassARRAY Assay Design software (Agena Bioscience, California, USA). Multiplex PCR and locus-specific single-nucleotide extension are performed for each DNA sample, then the products are desalted and transferred to a 384-well SpectroCHIP array. After MALDI-TOF (matrix-assisted laser desorption/ionization-time of flight) mass spectrometry, MassArray Typer software (V.4.1, Agena Bioscience, California, USA) is used to call the genotype for each participant.

After the accomplishment of WGS and SNV genotyping, the genotypes of the 21 SNVs are compared between those that are respectively obtained from WGS data analyses and MALDI-TOF mass spectrometry to verify that DNA sample and sequencing data originates from the same individual.

WGS data cleanup

Raw sequence reads are filtered using an in-house pipeline for quality control. The following steps are executed consecutively: Removing both of the paired reads if (1) any one of the reads contain sequencing adapter, (2) any one of the reads whose low-quality base ratio (base quality less than or equal to 12) is more than 50%, (3) any one of the reads whose unknown base (‘N’ base) ratio is more than 10%. Afterwards, fastp (V.0.20.0) is applied to filter out low-quality reads and bases,14 and downstream bioinformatics analyses are conducted on these qualified data.

Mapping and variant calling

The paired-end reads are processed under the Genome Analysis Toolkit (GATK) best practice guidance using Sentieon (release 201808.05, https://www.sentieon.com, bioRxiv 115717; doi:10.1101/115717).15 The reads are aligned to the hg38 human reference genome sequence that is downloaded from GATK bundle (ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz) using Burrows-Wheeler Alignment tool that is implemented in Sentieon. The SNVs and insertion-deletions (indels) in the regions of segmental duplications and unassigned chromosomes are ignored in the downstream analyses. For each sample, the base quality, sequencing depth, GC (guanine-cytosine) content, mapping rate, mismatch rate, duplication rate and coverage is calculated. After removing the duplicated reads and recalibrating the base quality scores, SNVs and indels are first called using Haplotyper of Sentieon for each individual and then jointly called for all of the participants. Then, variant quality score recalibration and hard filter methods are applied to obtain the high-quality variant calls for SNVs and indels. The ‘*.bam’ and ‘*.vcf’ files that are generated in the above procedures would be reserved for other researches. Copy number variations (CNVs) and structural variations (SVs) in the genome of patients are mainly called using GraphTyper2 and Manta.16 17

Population genetics analysis

To minimise problems arising from hidden family and population structure in the participants, we conduct the following quality control steps. First, kinship is explored by calculating pairwise identity-by-descent calculations for all pairs of individuals using PLINK (V.1.9).18 The existence of first and second degree relationships is checked using KING (V.2.1.8).19 Second, population structure is investigated using STRUCTURE software and by conducting principal component analysis.20 All of these analyses are conducted using autosomal SNVs and indels.

Variant annotation

Impact of the mutations on protein coding and protein truncating variants were predicted using variant effect predictor.21 Pathogenicity of SNVs and indels are evaluated using InterVar software (V.2.0.1) under guidelines of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.22 The potential impact of SNVs and indels on gene expression/regulation is investigated by reviewing GTEx, HaploReg and other databases or online tools.23 24 The impact of intronic and exonic mutations on pre-messenger RNA splicing is mainly predicted using SpliceAI.25

Biological significance of known or common CNVs and SVs are annotated by reviewing dbVar and Database of Genomic Variants.26 27 Novel CNVs and SVs are annotated by reviewing literatures on structure and function studies of the genes affected by the corresponding CNVs and SVs from PubMed.

Checking and reviewing

During the experimental procedures of this project, all of the WBCs and DNA loading, packaging, transferring and storing operations was conducted by one technician while being checked and supervised by another technician.

For WGS and SNV genotyping data, the MD5 code is generated for each data file before transfer, and is checked after the transfer. The commands and codes for WGS data mapping and variant calling are written by one bioinformatician while being reviewed by another bioinformatician. The log files are also reviewed and reserved.

All of the genetic information, clinical data and biospecimens are managed following the Regulations of the People’s Republic of China on Administration of Human Genetic Resources 2019.

Research projects

WGS data of 10K patients will be incorporated to identify the causality of certain risk factors for stroke outcomes, to investigate pleiotropic effect of genes on multiple phenotypes, and to understand the genetic relationship between particular comorbidities and IS. The accurate sequencing data from greater than 30× average depth in the WGS study also allows us to obtain a panoramic view of individual-specific variation and genetic structure of Chinese patients with IS or TIA. Some prespecified research topics are described below:

To draw a comprehensive genetic landscape of Chinese patients with IS or TIA, and characterise the geographical, lifestyle differences and their demographic origin;
To evaluate the genetic contribution to IS and its recurrent outcomes, especially the contribution of rare variants, CNVs and variants in certain region of the genome (eg, telomere and mitochondrial DNA);
To determine the causality of serum biomarkers for IS outcomes using association analyses and Mendelian randomisation;
To investigate the relationship between genetic features and brain imaging changes in IS;
To conduct the pharmacogenomics analyses on certain secondary prevention of IS;
To better understand the genetic mechanisms of IS with particular comorbidities (eg, chronic kidney disease, diabetes mellitus and hypertension).

Results

Among the 15 166 patients with IS or TIA in CNSR-III, 12 603 patients participated in the prespecified genetic substudy. Among them, 1308 participants did not provide enough WBCs. After DNA extraction and quality evaluation, the DNA of 381 participants was insufficient or unqualified. Therefore, a total of 1689 participants were excluded and WGS are conducted for 10 914 participants of CNSR-III (figure 1). The workflow of WGS and downstream bioinformatics analyses are shown in figure 2.

Figure 2

Workflow of WGS and bioinformatics analyses. The first two rows shows the process of DNA extraction, quality control, library construction and WGS. The third row demonstrates downstream bioinformatics analyses of WGS data. Some of the images are retrieved or adapted from Servier Medical Art (https://smart.servier.com/), which is licensed under a Creative Commons Attribution 3.0 Unported License. The photos of instruments are downloaded from websites of BGI Genomics (https://www.bgi.com/), Thermo Fisher (https://www.thermofisher.com/) and Agena Bioscience (https://agenabio.com/), respectively. DNB,DNA Nanoball; IS, ischaemic stroke; ssCir DNA, single-stranded circle DNA; TIA, transient ischaemic attack; WGS, whole genome sequencing.

Baseline clinical characteristics of the included 10 914 patients and excluded patients were presented in table 1. The average age was 62.2±11.3 years, and 31.4% of the patients were women. Patients diagnosed to be IS were 10 166 (93.2%), among which 50.4% had minor stroke (NIHSS (National Institutes of Health Stroke Scale score) ≤3). A total of 31.8% of the included patients were current smokers, and 14.5% were heavy drinkers (defined as ≥2 standard alcohol consumption per day). A total of 21.3% of the included patients had a history of IS. A total of 10.8%, 7.0% and 62.8% of the included patients had a history of coronary heart disease, atrial fibrillation and hypertension, respectively. The two groups of included and excluded patients were balanced regarding baseline characteristics (table 1).

View this table:

Table 1

Baseline characteristics of the included patients in the patients who underwent whole genome sequencing and the rest of the patients in CNSR-III

Discussion

Stroke is a complex disease that has multiple aetiologies. Genetic and genomic studies among populations from diverse ancestry could refine our understanding on molecular mechanism of stroke. Therefore, we conduct WGS for 10 914 patients from CNSR-III. The WGS procedures and baseline characteristics of patients are reported in this study. The WGS of CNSR-III constructs a genomic data set that facilitate large scale IS genetic analyses in Chinese population. The CNSR-III collected a comprehensive spectrum of phenotypic information under consistent and standardised criteria, which could increase the power and credibility of the genetic analyses. In addition, all of the patients are followed up for clinical outcomes,12 and this provides an opportunity for discovery of genetic variants that are associated with patients’ outcomes after stroke.

In contrast to DNA microarrays that were mainly used in previous genetic associations on IS,4 10WGS technology applied in this study could provide nearly all of the SNVs and indels, and simultaneously capture genetic information on CNVs and SVs for each patient. Therefore, WGS enables a systematic evaluation of the genetic effect of rare variants (allele frequencies <1% in population) to IS and TIA. As the contribution of the rare variants remains one of the top challenges in stroke genetics, the WGS study would provide a better understanding on IS and TIA pathophysiology.10 The average depth for WGS is intended to be greater than 30× in this project, because at this depth, both accurate variant calling and cost-effectiveness could be achieved.28 29 Moreover, >95% the genome could be covered by at least 10 sequencing reads, and >95% of the heterozygous variation could be accurately identified under this design.30 Therefore, the WGS could provide high-quality genetic data for further investigations on IS.

In conclusion, the WGS and genome-wide analyses on CNSR-III would help to refine our understanding on the genetic contribution to IS/TIA and stroke outcomes, and possibly discover novel therapeutic targets for secondary prevention.

Data availability statement

Data are available upon reasonable request. Data in this article are available upon reasonable request.

Ethics statements

Ethics approval

The study was approved by the ethics committees of Beijing Tiantan Hospital and all other research centres according to the principles expressed in the Declaration of Helsinki.

References

↵
1. GBD 2017 Causes of Death Collaborators
. Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980-2017: a systematic analysis for the global burden of disease study 2017. Lancet 2018;392:1736–88.doi:10.1016/S0140-6736(18)32203-7 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30496103
OpenUrl CrossRef PubMed
↵
2. Wang Y ,
3. Li Z ,
4. Wang Y , et al
. Chinese stroke center alliance: a national effort to improve healthcare quality for acute stroke and transient ischaemic attack: rationale, design and preliminary findings. Stroke Vasc Neurol 2018;3:256–62.doi:10.1136/svn-2018-000154 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30637133
OpenUrl Abstract/FREE Full Text
↵
2. Bersano A ,
3. Markus HS ,
4. Quaglini S , et al
. Clinical Pregenetic screening for stroke monogenic diseases: results from Lombardia GENS registry. Stroke 2016;47:1702–9.doi:10.1161/STROKEAHA.115.012281 pmid:http://www.ncbi.nlm.nih.gov/pubmed/27245348
OpenUrl Abstract/FREE Full Text
↵
2. Malik R ,
3. Chauhan G ,
4. Traylor M , et al
. Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes. Nat Genet 2018;50:524–37.doi:10.1038/s41588-018-0058-3 pmid:http://www.ncbi.nlm.nih.gov/pubmed/29531354
OpenUrl CrossRef PubMed
↵
1. NINDS Stroke Genetics Network (SiGN), International Stroke Genetics Consortium (ISGC)
2. Pulit SL ,
3. McArdle PF ,
4. Wong Q
. Loci associated with ischaemic stroke and its subtypes (sign): a genome-wide association study. Lancet Neurol 2016;15:174–84.doi:10.1016/S1474-4422(15)00338-5 pmid:http://www.ncbi.nlm.nih.gov/pubmed/26708676
OpenUrl CrossRef PubMed
↵
1. Neurology Working Group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium, the Stroke Genetics Network (SiGN), and the International Stroke Genetics Consortium (ISGC)
. Identification of additional risk loci for stroke and small vessel disease: a meta-analysis of genome-wide association studies. Lancet Neurol 2016;15:695–707.doi:10.1016/S1474-4422(16)00102-2 pmid:http://www.ncbi.nlm.nih.gov/pubmed/27068588
OpenUrl CrossRef PubMed
↵
2. Traylor M ,
3. Farrall M ,
4. Holliday EG , et al
. Genetic risk factors for ischaemic stroke and its subtypes (the METASTROKE collaboration): a meta-analysis of genome-wide association studies. Lancet Neurol 2012;11:951–62.doi:10.1016/S1474-4422(12)70234-X pmid:http://www.ncbi.nlm.nih.gov/pubmed/23041239
OpenUrl CrossRef PubMed
↵
2. Holliday EG ,
3. Maguire JM ,
4. Evans T-J , et al
. Common variants at 6p21.1 are associated with large artery atherosclerotic stroke. Nat Genet 2012;44:1147–51.doi:10.1038/ng.2397 pmid:http://www.ncbi.nlm.nih.gov/pubmed/22941190
OpenUrl CrossRef PubMed
↵
1. International Stroke Genetics Consortium (ISGC), Wellcome Trust Case Control Consortium 2 (WTCCC2),
2. Bellenguez C , et al
. Genome-Wide association study identifies a variant in HDAC9 associated with large vessel ischemic stroke. Nat Genet 2012;44:328–33.doi:10.1038/ng.1081 pmid:http://www.ncbi.nlm.nih.gov/pubmed/22306652
OpenUrl CrossRef PubMed
↵
2. Dichgans M ,
3. Pulit SL ,
4. Rosand J
. Stroke genetics: discovery, biology, and clinical applications. Lancet Neurol 2019;18:587–99.doi:10.1016/S1474-4422(19)30043-2 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30975520
OpenUrl PubMed
↵
2. Sirugo G ,
3. Williams SM ,
4. Tishkoff SA
. The missing diversity in human genetic studies. Cell 2019;177:26–31.doi:10.1016/j.cell.2019.02.048 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30901543
OpenUrl PubMed
↵
2. Wang Y ,
3. Jing J ,
4. Meng X , et al
. The third China national stroke registry (CNSR-III) for patients with acute ischaemic stroke or transient ischaemic attack: design, rationale and baseline patient characteristics. Stroke Vasc Neurol 2019;4:158–64.doi:10.1136/svn-2019-000242 pmid:http://www.ncbi.nlm.nih.gov/pubmed/31709123
OpenUrl Abstract/FREE Full Text
↵
2. Auton A ,
3. Brooks LD , 1000 Genomes Project Consortium, et al
. A global reference for human genetic variation. Nature 2015;526:68–74.doi:10.1038/nature15393 pmid:http://www.ncbi.nlm.nih.gov/pubmed/26432245
OpenUrl CrossRef PubMed
↵
2. Chen S ,
3. Zhou Y ,
4. Chen Y , et al
. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018;34:i884–90.doi:10.1093/bioinformatics/bty560 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30423086
OpenUrl CrossRef PubMed
↵
2. Van der Auwera GA ,
3. Carneiro MO ,
4. Hartl C , et al
. From FastQ data to high confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics 2013;43:11.10.1-11.10.33.doi:10.1002/0471250953.bi1110s43 pmid:http://www.ncbi.nlm.nih.gov/pubmed/25431634
OpenUrl PubMed
↵
2. Eggertsson HP ,
3. Kristmundsdottir S ,
4. Beyter D , et al
. GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs. Nat Commun 2019;10:5402. doi:10.1038/s41467-019-13341-9 pmid:http://www.ncbi.nlm.nih.gov/pubmed/31776332
OpenUrl CrossRef PubMed
↵
2. Chen X ,
3. Schulz-Trieglaff O ,
4. Shaw R , et al
. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 2016;32:1220–2.doi:10.1093/bioinformatics/btv710 pmid:http://www.ncbi.nlm.nih.gov/pubmed/26647377
OpenUrl CrossRef PubMed
↵
2. Chang CC ,
3. Chow CC ,
4. Tellier LC , et al
. Second-Generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 2015;4:7. doi:10.1186/s13742-015-0047-8 pmid:http://www.ncbi.nlm.nih.gov/pubmed/25722852
OpenUrl CrossRef PubMed
↵
2. Manichaikul A ,
3. Mychaleckyj JC ,
4. Rich SS , et al
. Robust relationship inference in genome-wide association studies. Bioinformatics 2010;26:2867–73.doi:10.1093/bioinformatics/btq559 pmid:http://www.ncbi.nlm.nih.gov/pubmed/20926424
OpenUrl CrossRef PubMed Web of Science
↵
2. Hubisz MJ ,
3. Falush D ,
4. Stephens M , et al
. Inferring weak population structure with the assistance of sample group information. Mol Ecol Resour 2009;9:1322–32.doi:10.1111/j.1755-0998.2009.02591.x pmid:http://www.ncbi.nlm.nih.gov/pubmed/21564903
OpenUrl CrossRef PubMed Web of Science
↵
2. McLaren W ,
3. Gil L ,
4. Hunt SE , et al
. The Ensembl variant effect predictor. Genome Biol 2016;17:122. doi:10.1186/s13059-016-0974-4 pmid:http://www.ncbi.nlm.nih.gov/pubmed/27268795
OpenUrl CrossRef PubMed
↵
2. Li Q ,
3. Wang K
. InterVar: clinical interpretation of genetic variants by the 2015 ACMG-AMP guidelines. Am J Hum Genet 2017;100:267–80.doi:10.1016/j.ajhg.2017.01.004 pmid:http://www.ncbi.nlm.nih.gov/pubmed/28132688
OpenUrl PubMed
↵
1. GTEx Consortium
. Human genomics. The Genotype-Tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 2015;348:648–60.doi:10.1126/science.1262110 pmid:http://www.ncbi.nlm.nih.gov/pubmed/25954001
OpenUrl Abstract/FREE Full Text
↵
2. Ward LD ,
3. Kellis M
. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res 2012;40:D930–4.doi:10.1093/nar/gkr917 pmid:http://www.ncbi.nlm.nih.gov/pubmed/22064851
OpenUrl CrossRef PubMed Web of Science
↵
2. Jaganathan K ,
3. Kyriazopoulou Panagiotopoulou S ,
4. McRae JF , et al
. Predicting splicing from primary sequence with deep learning. Cell 2019;176:535–48.doi:10.1016/j.cell.2018.12.015 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30661751
OpenUrl PubMed
↵
2. Lappalainen I ,
3. Lopez J ,
4. Skipper L , et al
. DbVar and DGVa: public Archives for genomic structural variation. Nucleic Acids Res 2013;41:D936–41.doi:10.1093/nar/gks1213 pmid:http://www.ncbi.nlm.nih.gov/pubmed/23193291
OpenUrl CrossRef PubMed Web of Science
↵
2. MacDonald JR ,
3. Ziman R ,
4. Yuen RKC , et al
. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res 2014;42:D986–92.doi:10.1093/nar/gkt958 pmid:http://www.ncbi.nlm.nih.gov/pubmed/24174537
OpenUrl CrossRef PubMed Web of Science
↵
2. Kishikawa T ,
3. Momozawa Y ,
4. Ozeki T , et al
. Empirical evaluation of variant calling accuracy using ultra-deep whole-genome sequencing data. Sci Rep 2019;9:1784. doi:10.1038/s41598-018-38346-0 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30741997
OpenUrl PubMed
↵
2. Rashkin S ,
3. Jun G ,
4. Chen S , et al
. Optimal sequencing strategies for identifying disease-associated singletons. PLoS Genet 2017;13:e1006811. doi:10.1371/journal.pgen.1006811 pmid:http://www.ncbi.nlm.nih.gov/pubmed/28640830
OpenUrl PubMed
↵
2. Bentley DR ,
3. Balasubramanian S ,
4. Swerdlow HP , et al
. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008;456:53–9.doi:10.1038/nature07517 pmid:http://www.ncbi.nlm.nih.gov/pubmed/18987734
OpenUrl CrossRef PubMed Web of Science

Footnotes

SC, ZX and YL are joint first authors.
Twitter @yilong
Contributors Study concept and design: SC, HaL and YoW. Drafting of the manuscript: SC, ZX and YL. Statistical analysis: AW, XH and ZX. Study supervision and organisation of the project: JL, YJ, XM, HaL, YiW and YoW. Supplying patients: ZW, GC, SW, ZJ, YC, XQ, JW, BS, WJ, ZA, WX, LZ, YG and HoL.
Funding This study was supported by grants from the Ministry of Science and Technology of the People’s Republic of China (2016YFC0901002, 2016YFC0901001), Beijing Municipal Science & Technology Commission (D171100003017002)，Beijing Municipal Administration of Hospitals’ Mission Plan (SML20150502) and National Science and Technology Major Project (2017ZX09304018). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Competing interests None declared.
Provenance and peer review Not commissioned; internally peer reviewed.

[1] ↵
GBD 2017 Causes of Death Collaborators
. Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980-2017: a systematic analysis for the global burden of disease study 2017. Lancet 2018;392:1736–88.doi:10.1016/S0140-6736(18)32203-7 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30496103
OpenUrl CrossRef PubMed

[2] GBD 2017 Causes of Death Collaborators

[3] ↵

Wang Y ,
Li Z ,
Wang Y , et al
. Chinese stroke center alliance: a national effort to improve healthcare quality for acute stroke and transient ischaemic attack: rationale, design and preliminary findings. Stroke Vasc Neurol 2018;3:256–62.doi:10.1136/svn-2018-000154 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30637133
OpenUrl Abstract/FREE Full Text

[5] Wang Y ,

[6] Li Z ,

[7] Wang Y , et al

[8] ↵

Bersano A ,
Markus HS ,
Quaglini S , et al
. Clinical Pregenetic screening for stroke monogenic diseases: results from Lombardia GENS registry. Stroke 2016;47:1702–9.doi:10.1161/STROKEAHA.115.012281 pmid:http://www.ncbi.nlm.nih.gov/pubmed/27245348
OpenUrl Abstract/FREE Full Text

[10] Bersano A ,

[11] Markus HS ,

[12] Quaglini S , et al

[13] ↵

Malik R ,
Chauhan G ,
Traylor M , et al
. Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes. Nat Genet 2018;50:524–37.doi:10.1038/s41588-018-0058-3 pmid:http://www.ncbi.nlm.nih.gov/pubmed/29531354
OpenUrl CrossRef PubMed

[15] Malik R ,

[16] Chauhan G ,

[17] Traylor M , et al

[18] ↵
NINDS Stroke Genetics Network (SiGN), International Stroke Genetics Consortium (ISGC)
Pulit SL ,
McArdle PF ,
Wong Q
. Loci associated with ischaemic stroke and its subtypes (sign): a genome-wide association study. Lancet Neurol 2016;15:174–84.doi:10.1016/S1474-4422(15)00338-5 pmid:http://www.ncbi.nlm.nih.gov/pubmed/26708676
OpenUrl CrossRef PubMed

[19] NINDS Stroke Genetics Network (SiGN), International Stroke Genetics Consortium (ISGC)

[20] Pulit SL ,

[21] McArdle PF ,

[22] Wong Q

[23] ↵
Neurology Working Group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium, the Stroke Genetics Network (SiGN), and the International Stroke Genetics Consortium (ISGC)
. Identification of additional risk loci for stroke and small vessel disease: a meta-analysis of genome-wide association studies. Lancet Neurol 2016;15:695–707.doi:10.1016/S1474-4422(16)00102-2 pmid:http://www.ncbi.nlm.nih.gov/pubmed/27068588
OpenUrl CrossRef PubMed

[24] Neurology Working Group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium, the Stroke Genetics Network (SiGN), and the International Stroke Genetics Consortium (ISGC)

[25] ↵

Traylor M ,
Farrall M ,
Holliday EG , et al
. Genetic risk factors for ischaemic stroke and its subtypes (the METASTROKE collaboration): a meta-analysis of genome-wide association studies. Lancet Neurol 2012;11:951–62.doi:10.1016/S1474-4422(12)70234-X pmid:http://www.ncbi.nlm.nih.gov/pubmed/23041239
OpenUrl CrossRef PubMed

[27] Traylor M ,

[28] Farrall M ,

[29] Holliday EG , et al

[30] ↵

Holliday EG ,
Maguire JM ,
Evans T-J , et al
. Common variants at 6p21.1 are associated with large artery atherosclerotic stroke. Nat Genet 2012;44:1147–51.doi:10.1038/ng.2397 pmid:http://www.ncbi.nlm.nih.gov/pubmed/22941190
OpenUrl CrossRef PubMed

[32] Holliday EG ,

[33] Maguire JM ,

[34] Evans T-J , et al

[35] ↵
International Stroke Genetics Consortium (ISGC), Wellcome Trust Case Control Consortium 2 (WTCCC2),
Bellenguez C , et al
. Genome-Wide association study identifies a variant in HDAC9 associated with large vessel ischemic stroke. Nat Genet 2012;44:328–33.doi:10.1038/ng.1081 pmid:http://www.ncbi.nlm.nih.gov/pubmed/22306652
OpenUrl CrossRef PubMed

[36] International Stroke Genetics Consortium (ISGC), Wellcome Trust Case Control Consortium 2 (WTCCC2),

[37] Bellenguez C , et al

[38] ↵

Dichgans M ,
Pulit SL ,
Rosand J
. Stroke genetics: discovery, biology, and clinical applications. Lancet Neurol 2019;18:587–99.doi:10.1016/S1474-4422(19)30043-2 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30975520
OpenUrl PubMed

[40] Dichgans M ,

[41] Pulit SL ,

[42] Rosand J

[43] ↵

Sirugo G ,
Williams SM ,
Tishkoff SA
. The missing diversity in human genetic studies. Cell 2019;177:26–31.doi:10.1016/j.cell.2019.02.048 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30901543
OpenUrl PubMed

[45] Sirugo G ,

[46] Williams SM ,

[47] Tishkoff SA

[48] ↵

Wang Y ,
Jing J ,
Meng X , et al
. The third China national stroke registry (CNSR-III) for patients with acute ischaemic stroke or transient ischaemic attack: design, rationale and baseline patient characteristics. Stroke Vasc Neurol 2019;4:158–64.doi:10.1136/svn-2019-000242 pmid:http://www.ncbi.nlm.nih.gov/pubmed/31709123
OpenUrl Abstract/FREE Full Text

[50] Wang Y ,

[51] Jing J ,

[52] Meng X , et al

[53] ↵

Auton A ,
Brooks LD , 1000 Genomes Project Consortium, et al
. A global reference for human genetic variation. Nature 2015;526:68–74.doi:10.1038/nature15393 pmid:http://www.ncbi.nlm.nih.gov/pubmed/26432245
OpenUrl CrossRef PubMed

[55] Auton A ,

[56] Brooks LD , 1000 Genomes Project Consortium, et al

[57] ↵

Chen S ,
Zhou Y ,
Chen Y , et al
. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018;34:i884–90.doi:10.1093/bioinformatics/bty560 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30423086
OpenUrl CrossRef PubMed

[59] Chen S ,

[60] Zhou Y ,

[61] Chen Y , et al

[62] ↵

Van der Auwera GA ,
Carneiro MO ,
Hartl C , et al
. From FastQ data to high confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics 2013;43:11.10.1-11.10.33.doi:10.1002/0471250953.bi1110s43 pmid:http://www.ncbi.nlm.nih.gov/pubmed/25431634
OpenUrl PubMed

[64] Van der Auwera GA ,

[65] Carneiro MO ,

[66] Hartl C , et al

[67] ↵

Eggertsson HP ,
Kristmundsdottir S ,
Beyter D , et al
. GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs. Nat Commun 2019;10:5402. doi:10.1038/s41467-019-13341-9 pmid:http://www.ncbi.nlm.nih.gov/pubmed/31776332
OpenUrl CrossRef PubMed

[69] Eggertsson HP ,

[70] Kristmundsdottir S ,

[71] Beyter D , et al

[72] ↵

Chen X ,
Schulz-Trieglaff O ,
Shaw R , et al
. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 2016;32:1220–2.doi:10.1093/bioinformatics/btv710 pmid:http://www.ncbi.nlm.nih.gov/pubmed/26647377
OpenUrl CrossRef PubMed

[74] Chen X ,

[75] Schulz-Trieglaff O ,

[76] Shaw R , et al

[77] ↵

Chang CC ,
Chow CC ,
Tellier LC , et al
. Second-Generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 2015;4:7. doi:10.1186/s13742-015-0047-8 pmid:http://www.ncbi.nlm.nih.gov/pubmed/25722852
OpenUrl CrossRef PubMed

[79] Chang CC ,

[80] Chow CC ,

[81] Tellier LC , et al

[82] ↵

Manichaikul A ,
Mychaleckyj JC ,
Rich SS , et al
. Robust relationship inference in genome-wide association studies. Bioinformatics 2010;26:2867–73.doi:10.1093/bioinformatics/btq559 pmid:http://www.ncbi.nlm.nih.gov/pubmed/20926424
OpenUrl CrossRef PubMed Web of Science

[84] Manichaikul A ,

[85] Mychaleckyj JC ,

[86] Rich SS , et al

[87] ↵

Hubisz MJ ,
Falush D ,
Stephens M , et al
. Inferring weak population structure with the assistance of sample group information. Mol Ecol Resour 2009;9:1322–32.doi:10.1111/j.1755-0998.2009.02591.x pmid:http://www.ncbi.nlm.nih.gov/pubmed/21564903
OpenUrl CrossRef PubMed Web of Science

[89] Hubisz MJ ,

[90] Falush D ,

[91] Stephens M , et al

[92] ↵

McLaren W ,
Gil L ,
Hunt SE , et al
. The Ensembl variant effect predictor. Genome Biol 2016;17:122. doi:10.1186/s13059-016-0974-4 pmid:http://www.ncbi.nlm.nih.gov/pubmed/27268795
OpenUrl CrossRef PubMed

[94] McLaren W ,

[95] Gil L ,

[96] Hunt SE , et al

[97] ↵

Li Q ,
Wang K
. InterVar: clinical interpretation of genetic variants by the 2015 ACMG-AMP guidelines. Am J Hum Genet 2017;100:267–80.doi:10.1016/j.ajhg.2017.01.004 pmid:http://www.ncbi.nlm.nih.gov/pubmed/28132688
OpenUrl PubMed

[99] Li Q ,

[100] Wang K

[101] ↵
GTEx Consortium
. Human genomics. The Genotype-Tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 2015;348:648–60.doi:10.1126/science.1262110 pmid:http://www.ncbi.nlm.nih.gov/pubmed/25954001
OpenUrl Abstract/FREE Full Text

[102] GTEx Consortium

[103] ↵

Ward LD ,
Kellis M
. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res 2012;40:D930–4.doi:10.1093/nar/gkr917 pmid:http://www.ncbi.nlm.nih.gov/pubmed/22064851
OpenUrl CrossRef PubMed Web of Science

[105] Ward LD ,

[106] Kellis M

[107] ↵

Jaganathan K ,
Kyriazopoulou Panagiotopoulou S ,
McRae JF , et al
. Predicting splicing from primary sequence with deep learning. Cell 2019;176:535–48.doi:10.1016/j.cell.2018.12.015 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30661751
OpenUrl PubMed

[109] Jaganathan K ,

[110] Kyriazopoulou Panagiotopoulou S ,

[111] McRae JF , et al

[112] ↵

Lappalainen I ,
Lopez J ,
Skipper L , et al
. DbVar and DGVa: public Archives for genomic structural variation. Nucleic Acids Res 2013;41:D936–41.doi:10.1093/nar/gks1213 pmid:http://www.ncbi.nlm.nih.gov/pubmed/23193291
OpenUrl CrossRef PubMed Web of Science

[114] Lappalainen I ,

[115] Lopez J ,

[116] Skipper L , et al

[117] ↵

MacDonald JR ,
Ziman R ,
Yuen RKC , et al
. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res 2014;42:D986–92.doi:10.1093/nar/gkt958 pmid:http://www.ncbi.nlm.nih.gov/pubmed/24174537
OpenUrl CrossRef PubMed Web of Science

[119] MacDonald JR ,

[120] Ziman R ,

[121] Yuen RKC , et al

[122] ↵

Kishikawa T ,
Momozawa Y ,
Ozeki T , et al
. Empirical evaluation of variant calling accuracy using ultra-deep whole-genome sequencing data. Sci Rep 2019;9:1784. doi:10.1038/s41598-018-38346-0 pmid:http://www.ncbi.nlm.nih.gov/pubmed/30741997
OpenUrl PubMed

[124] Kishikawa T ,

[125] Momozawa Y ,

[126] Ozeki T , et al

[127] ↵

Rashkin S ,
Jun G ,
Chen S , et al
. Optimal sequencing strategies for identifying disease-associated singletons. PLoS Genet 2017;13:e1006811. doi:10.1371/journal.pgen.1006811 pmid:http://www.ncbi.nlm.nih.gov/pubmed/28640830
OpenUrl PubMed

[129] Rashkin S ,

[130] Jun G ,

[131] Chen S , et al

[132] ↵

Bentley DR ,
Balasubramanian S ,
Swerdlow HP , et al
. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008;456:53–9.doi:10.1038/nature07517 pmid:http://www.ncbi.nlm.nih.gov/pubmed/18987734
OpenUrl CrossRef PubMed Web of Science

[134] Bentley DR ,

[135] Balasubramanian S ,

[136] Swerdlow HP , et al

Log in using your username and password

Main menu

Log in using your username and password

You are here

Abstract

Data availability statement

Statistics from Altmetric.com

Request Permissions

Introduction

Methods

Patients

DNA extraction

Evaluation of DNA quality

Library construction

BGISEQ-500 WGS sequencing

SNV genotyping

WGS data cleanup

Mapping and variant calling

Population genetics analysis

Variant annotation

Checking and reviewing

Research projects

Results

Discussion

Data availability statement

Ethics statements

Patient consent for publication

Ethics approval

References

Footnotes

Read the full text or download the PDF:

Log in using your username and password