- Yue Yang - Lead
- Annie Nadkarni - Writer, Tech support
- Poojalakshmi Sreedhar - Liaison
- ChunHsuan LO - Writer, Tech support
- David Enoma - Tech support
- Alex Guo - SysAdmin
Our goal is to calculate disease-specific patient-level PRS based on GWAS summary statistics for different disease.
Polygenic risk score (PRS) is a widely used method to model how the collective effect of many SNPs contributes to a phenotype. With the observance that Polygenic complex traits are associated with many SNPs, PRS is a powerful approach to associate individual genotype information to phenotype information.
PRS can be used to quantify the probability that an individual could potentially develop the disease status given the genotype data, which makes it possible for its application in clinical prognosis and genetic testing. With the GWAS summary statistics of many complex traits available, polygenic risk scores are easily calculated given individual variant calling data.
In this manuscript, we built a pipeline consisting of GWAS summary statistics downloading, PRS scores computing, predictive model construction, and visualization. This analysis flow can be leveraged to various clinical outcomes and potentially be used in clinical prognosis or genetic testing.
- GWAS summary statistics
- Genotype data
- Phenotype
-
Patient-level:
- PRS scores (.csv)
-
Cohort-level:
- Predictive model (.sav)
- Probability of being affected by the disease (.csv + visualization)
- Importance of each feature(PRS) in the phenotype - - prediction (.csv + visualization)
I. Data Acquisition and Preprocessing:
-
Download the GWAS summary statistics from GWAS Atlas
-
Edit the headers to be consistent with the nomenclature established by PRSice
-
Filter out irrelevant features
II. PRS Score Calculation:
-
Match the summary statistics and the variants called in the genotype data
-
Select variants by different clumping thresholds
-
Calculate the PRS scores using beta value of summary statistics accordingly
III. Clinical Outcome Prediction
-
Randomly generate phenotype labels for each of the sample
-
Train a random forest classifier to predict the clinical outcome based on the calculated cohort PRS scores, save the trained model
-
Predicts the probability of an individual being affected by the disease based on their PRS scores, as well as the impurity-based feature (each PRS score) importances to the prediction
-
Visualizes the disease probability and PRS score importances for explainability
Please use the DNAnexus workflows for PRS Computing and Phenotype Predictions to use this tool.
(for the working pipelines)- Example input: plink BED files/ clinical outcome for individuals accordingly
- Example output: PRS scores for individuals/ outcome prediction model using PRS scores/ outcome prediction model
Disease probability predicted for each sample:
Distribution of the probability of each sample being affected by the disease:
Distribution of the importance of each PRS score to the prediction:
PRS_value of each feature for the exmaple Person_01:
PRS_value of each feature for the exmaple Person_02:
PRS_value of each feature for the exmaple Person_03:
PRS_value of each feature for the exmaple Person_04:
Parameters of the prediction model (PRS weighting parameters for calculating the disease probability):
PRS_feature_importance as weighting-parameters of PRS_value for the predictive model:
- DNANexus documentation: https://documentation.dnanexus.com/developer/apps/execution-environment/connecting-to-jobs
- bigsnpR: https://cran.rstudio.com/web/packages/bigsnpr/bigsnpr.pdf
- Scikit-learn: https://scikit-learn.org/stable/