TIGAR stands for Transcriptome-Integrated Genetic Association Resource, which is developed using Python and BASH scripts for conducting Transcriptome-Wide Association Studies (TWAS) by training gene expression imputation models by nonparametric Bayesian Dirichlet Process Regression (DPR) and Elastic-Net (PrediXcan) methods with reference transcriptomic panels.
- TIGAR can train gene expression imputation models by both nonparametric Bayesian DPR and Elastic-Net (PrediXcan) methods with reference dataset that contain transcriptomic and genetic data of the same samples.
- Impute Genetically Regulated gene eXpression (GReX) from Individual-level genetic data.
- Conduct TWAS using both Individual-level and Summary-level GWAS data for studying Univariate and Multivariate phenotypes. Both FUSION and S-PrediXcan Z-score test statistics will calculated in the TWAS results.
- Conduct VC-TWAS using Individual-level and Summary-level GWAS data using cis-eQTL effect sizes estimated from reference panels for Univariate phenotypes.
- Note: please use our most recently updated scripts on Github to conduct TWAS.
- Cis-eQTL Effect Sizes (i.e., weights) estimated from GTEx V8 reference data of 48 tissue types by Nonparametric Bayesian DPR method, and Reference LD Files from GTEx V8 EUR samples, are available from Synapse:syn16804296.
- Please cite our TIGAR papers:
- Example Usage
- Updates
- Reference
- Installation:
# create the environment tigarenv
conda create --name tigarenv python=3.5 pandas numpy scipy scikit-learn statsmodels
# deactivate the conda environment
conda deactivate
- Activate Environment:
# activate the environment
conda activate tigarenv
# set the PYTHONPATH
export PYTHONPATH=${CONDA_PREFIX}/lib/python3.5/site-packages/:$PYTHONPATH
After running TIGAR use conda deactivate
to deactivate the conda environment.
- Installation:
# install pip
# install virtualenv
python3 -m pip install --user virtualenv
# cd to preferred install_directory
cd ${install_dir}
# create the virtual environment tigarenv in the current directory
python3 -m virtualenv tigarenv --python=python3.5
# activate the environment
source ${install_dir}/tigarenv/bin/activate
# install the packages
python3 -m pip install numpy==1.15.2 pandas==0.23.4 scikit-learn==0.20.0 scipy==1.1.0 statsmodels==0.9.0
# deactivate the environment
deactivate
- Activate Environment:
# activate the environment
source ${install_dir}/tigarenv/bin/activate
# set the PYTHONPATH
PYTHONPATH=${install_dir}/tigarenv/lib/python3.5/site-packages/:$PYTHONPATH
After running TIGAR use deactivate
to deactivate the environment.
- Make
*.sh
files in the TIGAR directory executable
TIGAR_dir="/home/jyang/GIT/TIGAR"
chmod 755 ${TIGAR_dir}/*.sh
- Make
${TIGAR_dir}/Model_Train_Pred/DPR
file executable
chmod 755 ${TIGAR_dir}/Model_Train_Pred/DPR
Example input files provided under ./ExampleData/
are generated artificially. All input files are Tab Delimited Text Files.
VCF (Variant Call Format)
--genofile_type vcf
--format GT
--format DS
CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | Sample1 | Sample... |
---|---|---|---|---|---|---|---|---|---|---|
1 | 100 | rs1 | C | T | . | PASS | . | GT:DS | 0/0:0.01 | ... |
- Sorted by chromosome and base pair position, bgzipped by
bgzip
, and tabixed bytabix
- The values in the CHROM column should not start with "chr"
- Example tabix command for a VCF file:
tabix -f -p vcf *.vcf.gz
. - First 9 columns are Chromosome number, Base pair position, Variant ID, Reference allele, Alternative allele, Quality score, Filter status, Variant information, Genotype data format
- Genotype data starts from the 10th column. Each column denotes the genotype data per sample.
- Example:
./ExampleData/example.vcf.gz
--genofile_type dosage
CHROM | POS | ID | REF | ALT | Sample1 | Sample... |
---|---|---|---|---|---|---|
1 | 100 | rs** | C | T | 0.01 | ... |
- First 5 columns have the same format as the first 5 columns of a VCF file.
- Dosage genotype data start from the 6th column that are values ranging from 0 to 2, denoting the number of minor alleles. Each column denotes the genotype data per sample.
--genofile
--genofile_type
--format
- Train Gene Expression Imputation Models
- Predict GReX
- Generate Reference LD Genotype Covariance Files
- VC-TWAS
- Headerless, single-column file containing sampleIDs to use.
- Examples:
./ExampleData/sampleID.txt
./ExampleData/test_sampleID.txt
--train_sampleID
- Train Gene Expression Imputation Models
--test_sampleID
- Predict GReX
- VC-TWAS
--sampleID
- Generate Reference LD Genotype Covariance Files
CHROM | GeneStart | GeneEnd | TargetID | GeneName | Sample1 | Sample... |
---|---|---|---|---|---|---|
1 | 100 | 200 | ENSG0000 | X | 0.2 | ... |
- First 5 columns are Chromosome number, Gene start position, Gene end position, Target gene ID, Gene name (optional, could be the same as Target gene ID).
- Gene expression data start from the 6th column. Each column denotes the corresponding gene expression value per sample.
- Example:
./ExampleData/gene_exp.txt
--gene_exp
- Train Gene Expression Imputation Models
- TWAS (with individual-level GWAS data)
CHROM | GeneStart | GeneEnd | TargetID | GeneName |
---|---|---|---|---|
1 | 100 | 200 | ENSG0000 | X |
- Provides a list of genes for TWAS
- Same format as the first five columns of the Gene Expression File.
- Example:
./ExampleData/gene_anno.txt
--gene_anno
- Predict GReX
- TWAS (with summary-level GWAS data)
- VC-TWAS
CHROM | POS | REF | ALT | TargetID | ES |
---|---|---|---|---|---|
1 | 100 | C | T | ENSG0000 | 0.2 |
- Contain cis-eQTL effect sizes (i.e., SNP weights) estimated from reference data for TWAS. The output
*_eQTLweights.txt
file by./TIGAR_Model_Train.sh
can be used here as the variant weight file. - First 5 columns must be be: Chromosome number, Base pair position, Reference allele, Alternative allele, and Target gene ID.
- Must contain a column named
ES
to denote variant weights (estimated eQTL effect sizes from a reference dataset) with for the gene (TargetID). - Each row denotes the variant weight (eQTL effect size) for testing the association of the gene (TargetID) with a phenotype
- Variants will be matched with those from the Zscore file by their unique
CHROM:POS:REF:ALT
snpID - Example:
./ExampleData/eQTLweights.txt
--weight
- Predict GReX
- TWAS (with summary-level GWAS data)
- VC-TWAS
FAM_ID | IND_ID | FAT_ID | MOT_ID | SEX | PHENO | COV1 | COV... |
---|---|---|---|---|---|---|---|
11A | 11A | X | X | 1 | 0.2 | 0.3 | ... |
- Phenotype PED File
- First five columns are Family ID, Individual ID, Paternal ID, Maternal ID, Sex (1=male; 2=female; other=unknown). The combination of family and individual ID should uniquely identify a person.
- Phenotype is the 6th column. A PED file must have 1 and only 1 phenotype in the 6th column.
- Other covariates start from the 7th column
- Example:
./ExampleData/example_PED.ped
--PED
- TWAS (with individual-level GWAS data)
- VC-TWAS
P | PHENO |
C | COV1 |
C | COV2 |
C | SEX |
- Phenotype and Covariate Information File
- Used to specify phenotype columns and covariate columns in the
PED
file to use for the TWAS. - Headerless, two-column, tab-delimited file. Specify one column per row. The first column of each row is the type (
P
for phenotype,C
for covariate) and the second column in each row is the name of the column in thePED
file. - Example:
./ExampleData/PED_Info_*.txt
--PED_info
- TWAS (with individual-level GWAS data)
- VC-TWAS
CHROM | Start | End |
---|---|---|
1 | 100 | 20000 |
- Genome block annotation file used for generating the reference LD covariance files (the LD file is required for TWAS with summary-level GWAS statistics)
- A tab-delimited text file with 3 columns
CHROM Start End
, denoting the chromosome number, block start position, and block ending position - Block annotation files can be created from genome segmentation files generated by LDetect
- Example:
./ExampleData/example_genome_block_CHR1.txt
--genom_block
- Generate Reference LD Genotype Covariance Files
- Contains LD covariance coefficients generated from reference data.
- Used for TWAS with GWAS summary statistics
- LD files for TWAS should be made with the same reference data used for model training.
--LD
- TWAS (with summary-level GWAS data)
CHROM | POS | REF | ALT | Zscore |
---|---|---|---|---|
1 | 100 | C | T | 0.01 |
- Contains GWAS summary statistics (Zscore) for TWAS
- First 4 columns have the same format as the first 4 columns of a VCF file.
- The name of the column containing the Zscore statistic values to use must be
Zscore
. - The file must be sorted by chromosome and base pair position, bgzipped by
bgzip
, and tabixed bytabix
. Example tabix command:tabix -f -p vcf *_Zscore.txt.gz
. - Example:
./ExampleData/CHR1_GWAS_Zscore.txt.gz
--Zscore
- TWAS (with summary-level GWAS data)
| CHROM | POS | REF | ALT | BETA | SE | |:-----:|:---:|:---:|:---:|:------:| | 1 | 100 | C | T | 0.01 | 0.52 |
- Contains GWAS summary statistics (BETA) and (SE) for TWAS
- First 4 columns have the same format as the first 4 columns of a VCF file.
- The name of the column containing the BETA and SE to use must be
BETA
andSE
. - The file must be sorted by chromosome and base pair position, bgzipped by
bgzip
, and tabixed bytabix
. Example tabix command:tabix -f -p vcf *_GWAS.txt.gz
. - Example:
./ExampleData/sample_GWAS_Result.txt.gz
--GWAS_result
- VC-TWAS (with summary-level GWAS data)
--model
: Gene expression imputation model:elastic_net
orDPR
--gene_exp
: Path to Gene annotation and Expression file--train_sampleID
: Path to a file with sampleIDs that will be used for training--chr
: Chromosome number need to be specified with respect to the genotype input data--genofile
: Path to the training genotype file (bgzipped and tabixed)--genofile_type
: Genotype file type:vcf
ordosage
--format
: (Required ifgenofile_type
isvcf
) Genotype data format that should be used:GT
orDS
GT
: genotype dataDS
: dosage data
--missing rate
: Missing rate threshold. If the rate of missing values for a SNP exceeds this value the SNP will be excluded. Otherwise, missing values for non-excluded SNPs will be imputed as the mean. (default:0.2
)--maf
: Minor Allele Frequency threshold (ranges from 0 to 1) to exclude rare variants (default:0.01
--hwe
: Hardy Weinberg Equilibrium p-value threshold to exclude variants that violated HWE (default:0.00001
)--window
: Window size (in base pairs) around gene region from which to include SNPs (default:1000000
[+- 1MB
region around gene region])--cvR2
: Whether to perform 5-fold cross validation by average R2:0
or1
(default:1
)0
: Skip 5-fold cross validation evaluation1
: Do 5-fold cross validation and only do final training if the average R2 of the 5-fold validations is at least >= 0.005
--thread
: Number of simultaneous processes to use for parallel computation (default:1
)--out_dir
: Output directory (will be created if it does not exist)--TIGAR_dir
: Specify the directory of TIGAR source code
--dpr
: Bayesian inference algorithm used by DPR:1
or2
(default:1
)1
: Variational Bayesian, faster but may be less accurate2
: MCMC, slower but more accurate
--ES
: Output effect size type:fixed
oradditive
(default:fixed
)fixed
: use fixed effects onlyadditive
: use the sum of fixed and random effects
# Setup input file paths
Gene_Exp_train_file="${TIGAR_dir}/ExampleData/gene_exp.txt"
train_sample_ID_file="${TIGAR_dir}/ExampleData/sampleID.txt"
genofile="${TIGAR_dir}/ExampleData/example.vcf.gz"
out_dir="${TIGAR_dir}/ExampleData/output"
# Call TIGAR model training shell script
${TIGAR_dir}/TIGAR_Model_Train.sh \
--model DPR \
--gene_exp ${Gene_Exp_train_file} \
--train_sampleID ${train_sample_ID_file} \
--chr 1 \
--genofile ${genofile} \
--genofile_type vcf \
--format GT \
--maf 0.01 \
--hwe 0.0001 \
--cvR2 1 \
--dpr 1 \
--ES fixed \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}
--cv
: Number of cross validation folds for tuning elastic-net penalty parameter (default:5
)--alpha
: Fixed L1 & L2 penalty ratio for elastic-net model (default:0.5
)- alpha=0: equivalent to lasso regression
- alpha=1: equivalent to ridge regression
Gene_Exp_train_file="${TIGAR_dir}/ExampleData/gene_exp.txt"
train_sample_ID_file="${TIGAR_dir}/ExampleData/sampleID.txt"
genofile="${TIGAR_dir}/ExampleData/example.vcf.gz"
out_dir="${TIGAR_dir}/ExampleData/output"
${TIGAR_dir}/TIGAR_Model_Train.sh \
--model elastic_net \
--gene_exp ${Gene_Exp_train_file} \
--train_sampleID ${train_sample_ID_file} \
--chr 1 \
--genofile ${genofile} \
--genofile_type vcf \
--format GT \
--maf 0.01 \
--hwe 0.0001 \
--cvR2 1 \
--alpha 0.5 \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}
CHR${chr}_${model}_eQTLweights.txt
is the file storing all eQTL effect sizes (ES
) estimated from the gene expression imputation model per gene (TargetID
)CHR${chr}_${model}_GeneInfo.txt
is the file storing information about the fitted gene expression imputation model per gene (per row), including gene annotation (CHROM GeneStart GeneEnd TargetID GeneName
), sample size, number of SNPs used in the model training (n_snp
), number of SNPs with non-zero eQTL effect sizes (n_effect_snp
), imputation R2 by 5-fold cross validation (CVR2
), imputation R2 using all given training samples (TrainR2
)CHR${chr}_${model}_train_log.txt
is the file storing all log messages for model training.
Predict GReX value with given variant weights (eQTL effect sizes) from trained gene expression imputation models and individual-level genotype data of test samples
--gene_anno
: Gene annotation file to specify the list of genes, which is of the same format as the first five columsn of gene expression file--test_sampleID
: Path to a file with sampleIDs that should be contained in the genotype file which should be used for the prediction--chr
: Chromosome number need to be specified with respect to the genotype input data--weight
: Path to SNP weight (eQTL effect size) file--genofile
: Path to the training genotype file (bgzipped and tabixed)--genofile_type
: Genotype file type:vcf
ordosage
--format
: (Required ifgenofile_type
isvcf
) Genotype data format that should be used:GT
orDS
GT
: genotype dataDS
: dosage data
--window
: Window size (in base pairs) around gene region from which to include SNPs (default:1000000
[+- 1MB
region around gene region])--missing rate
: Missing rate threshold. If the rate of missing values for a SNP exceeds this value the SNP will be excluded. Otherwise, missing values for non-excluded SNPs will be imputed as the mean. (default:0.2
)--maf_diff
: MAF difference threshold. If the difference in MAF between a matching SNP in the eQTL weight file and test genotype file is greater thanmaf_diff
, that SNP will be excluded. (default:0.2
)--thread
: Number of simultaneous processes to use for parallel computation (default:1
)--out_dir
: Output directory (will be created if it does not exist)--TIGAR_dir
: Specify the directory of TIGAR source code
CHR${chr}_Pred_GReX.txt
: File containing predicted GReX values in the same format as the input gene expression fileCHR${chr}_Pred_log.txt
: File containing log messages
gene_anno_file="${TIGAR_dir}/ExampleData/gene_anno.txt"
test_sample_ID_file="${TIGAR_dir}/ExampleData/test_sampleID.txt"
eQTL_ES_file="${TIGAR_dir}/ExampleData/eQTLweights.txt.gz"
${TIGAR_dir}/TIGAR_GReX_Pred.sh \
--gene_anno ${gene_anno_file} \
--test_sampleID ${test_sample_ID_file} \
--chr 1 \
--weight ${eQTL_ES_file} \
--genofile ${genofile} \
--genofile_type vcf \
--format GT \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}
--asso
: Specify TWAS type:1
or2
1
: individual-level TWAS using individual-level phenotype data and predicted GReX (generated byTIGAR_GReX_Pred.sh
)2
: summary-level TWAS using GWAS summary Z-score statistics and reference LD (generated byTIGAR_LD.sh
)
--thread
: Number of simultaneous processes to use for parallel computation (default:1
)--out_dir
: Output directory (will be created if it does not exist)--TIGAR_dir
: Specify the directory of TIGAR source code
--gene_exp
: Path to predicted GReX file--PED
: Path to PED file that contains phenotype and covariate data--PED_info
: A two-column, tab-delimited file specifying phenotype columns and covariate columns in thePED
file to use for the TWAS. Specify one column per row. Each row should start with the type of column (P
for phenotype,C
for covariate) and the column name.--method
: Method to use based on type of phenotype(s):OLS
orLogit
(default:OLS
)OLS
: quantitative phenotype or multivariate phenotypesLogit
: dichotomous univariate phenotype
indv_${method}_assoc.txt
indv_OLS_TWAS_log.txt
: File containing log messages
GReX_pred_file="${TIGAR_dir}/ExampleData/CHR1_Pred_GReX.txt"
PED="${TIGAR_dir}/ExampleData/example_PED.ped"
PED_info="${TIGAR_dir}/ExampleData/PED_Info_SinglePheno.txt"
${TIGAR_dir}/TIGAR_TWAS.sh \
--asso 1 \
--gene_exp ${GReX_pred_file} \
--PED ${PED} \
--PED_info ${PED_info} \
--method OLS \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}
--gene_anno
: Path to gene annotation file to specify a list of gene for TWAS. The first six columns of the gene expression file can be used here.--chr
: Chromosome number need to be specified to conduct TWAS per chromosome--weight
: Path to SNP weight (eQTL effect size) file. Output file_eQTLweights.txt
from model training can be used here.--Zscore
: Path to GWAS summary Zscore statistics.--LD
: Path to reference LD (SNP genotype covariance matrix) that should be bgzipped and tabixed. Can be generated by using our${TIGAR_dir}/TWAS/Get_LD.py
script with individual-level genotype data from a reference sample.- LD files for TWAS should be made with the same reference data used for model training.
- The tool assumes that the reference/alternate allele orientation of SNPs in the weight file match the LD file. Zscore SNPs are "flipped" to match the weight file.
--window
: Window size (in base pairs) around gene region from which to include SNPs (default:1000000
[+- 1MB
region around gene region])--weight_threshold
: include only SNPs with magnitude of weight greater than this value when conducting TWAS (default:0
)--test_stat
: burden Z test statistic to calculate:FUSION
,SPrediXcan
, orboth
(defaultboth
)FUSION
: calculate output Zscore and Pvalue results based on method used by FUSIONSPrediXcan
: calculate output Zscore and Pvalue results based on method used by S-PrediXcanboth
: calculate and output Zscore and Pvalue results for both methods
CHR${chr}_sumstat_assoc.txt
CHR${chr}_TWAS_log.txt
gene_anno_file="${TIGAR_dir}/ExampleData/gene_anno.txt"
eQTL_ES_file="${TIGAR_dir}/ExampleData/eQTLweights.txt.gz"
Zscore_file="${TIGAR_dir}/ExampleData/CHR1_GWAS_Zscore.txt.gz"
LD_file="${TIGAR_dir}/ExampleData/CHR1_reference_cov.txt.gz"
${TIGAR_dir}/TIGAR_TWAS.sh \
--asso 2 \
--gene_anno ${gene_anno_file} \
--chr 1 \
--weight ${eQTL_ES_file} \
--Zscore ${Zscore_file} \
--LD ${LD_file} \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}
Reference LD genotype covaraince file can be generated using reference genotype data, which should be generated per chromosome with corresponding genome segmentation block anntoation file.
LD files for TWAS should be made with the same reference data used for model training.
--genom_block
: Path to genome segmentation block annotation based on LD--sampleID
: Path to a file with sampleIDs to be used for generating reference LD files.--chr
: Specify chromosome number of the given reference genotype file--genofile
: Path to the reference genotype file (bgzipped and tabixed). It is recommended that genotype files with data for multiple chromosomes be split into per-chromosome files.--genofile_type
: Genotype file type:vcf
ordosage
--format
: (Required ifgenofile_type
isvcf
) Genotype data format that should be used:GT
orDS
GT
: genotype dataDS
: dosage data
--maf
: Minor Allele Frequency threshold (ranges from 0 to 1) to exclude rare variants (default:0.01
)--thread
: Number of simultaneous processes to use for parallel computation (default:1
)--out_dir
: Output directory (will be created if it does not exist)--TIGAR_dir
: Specify the directory of TIGAR source code
CHR${chr}_reference_cov.txt.gz
CHR${chr}_reference_cov.txt.gz.tbi
CHR${chr}_LD_log.txt
block_annotation="${TIGAR_dir}/ExampleData/example_genome_block_CHR1.txt"
sample_id="${TIGAR_dir}/sampleID.txt"
${TIGAR_dir}/TIGAR_LD.sh \
--genome_block ${block_annotation} \
--sampleID ${sample_id}
--chr 1 \
--genofile ${genofile} \
--genofile_type vcf \
--format GT \
--maf 0.01 \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}
VC-TWAS with cis-eQTL effect sizes estimated from nonparametric Bayesian DPR method for Univariate phenotypes.
--gene_anno
: Gene annotation file to specify the list of genes, which is of the same format as the first five columns of gene expression file--weight
: Path to SNP weight (eQTL effect size) file. The weight file must be generated using the DPR model.--weight_threshold
: Threshold for estimated cis-eQTL effect sizes, filter SNPs with absolute cis-eQTL effect sizes smaller than threshold--chr
: Chromosome number need to be specified with respect to the genotype input data--window
: Window size around gene region from which to include SNPs (default1000000
for+- 1MB
region around gene region)--thread
: Number of simultaneous processes to use for parallel computation (default:1
)--out_dir
: Output directory (will be created if it does not exist)--TIGAR_dir
: Specify the directory of TIGAR source code
--PED
: Path to PED file that contains phenotype and covariate data--PED_info
: A two-column, tab-delimited file specifying phenotype columns and covariate columns in thePED
file to use for the TWAS. Specify one column per row. Each row should start with the type of column (P
for phenotype,C
for covariate) and the column name.--test_sampleID
: Path to a file with sampleIDs that should be contained in the genotype file--genofile
: Path to the training genotype file (bgzipped and tabixed)--genofile_type
: Genotype file type:vcf
ordosage
--format
: (Required ifgenofile_type
isvcf
) Genotype data format that should be used:GT
orDS
GT
: genotype dataDS
: dosage data
--maf
: Minor Allele Frequency threshold (ranges from 0 to 1) to exclude rare variants (default:0.01
))--phenotype_type
: phenotype type:C
orD
C
: continous phenotypeD
: binomial (dichotomous) phenotype
CHR${chr}_indv_VC_TWAS.txt
: File containing VC_TWAS results with TargetID gene informationCHR${chr}_VC_TWAS_log.txt
: File containing log messages
gene_anno_file="${TIGAR_dir}/ExampleData/gene_anno.txt"
PED_info="${TIGAR_dir}/ExampleData/PED_Info_SinglePheno.txt"
test_sample_ID_file="${TIGAR_dir}/ExampleData/test_sampleID.txt"
eQTL_ES_file="${TIGAR_dir}/ExampleData/eQTLweights.txt"
genofile="${TIGAR_dir}/ExampleData/example.vcf.gz"
# continuous phenotype
PED="${TIGAR_dir}/ExampleData/example_PED.ped"
${TIGAR_dir}/VC_TWAS.sh \
--gene_anno ${gene_anno_file} \
--PED ${PED} \
--PED_info ${PED_info} \
--test_sampleID ${test_sample_ID_file} \
--chr 1 \
--weight ${eQTL_ES_file} \
--genofile ${genofile} \
--genofile_type vcf \
--format GT \
--weight_threshold 0.0001 \
--phenotype_type C \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}
# dichotomous phenotype
PED="${TIGAR_dir}/ExampleData/example_PED_binary.ped"
${TIGAR_dir}/VC_TWAS.sh \
--gene_anno ${gene_anno_file} \
--PED ${PED} \
--PED_info ${PED_info} \
--test_sampleID ${test_sample_ID_file} \
--chr 1 \
--weight ${eQTL_ES_file} \
--genofile ${genofile} \
--genofile_type vcf \
--format GT \
--weight_threshold 0.0001 \
--phenotype_type D \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}
--GWAS_result
: Path to GWAS result file--sample_size
: Sample size of summary-level GWAS data--LD
: Path to Reference LD genotype covariance file
CHR${chr}_sum_VC_TWAS.txt
: File containing VC_TWAS results with TargetID gene informationCHR${chr}_sum_VC_TWAS_log.txt
: File containing log messages
gene_anno_file="${TIGAR_dir}/ExampleData/gene_anno.txt"
eQTL_ES_file="${TIGAR_dir}/ExampleData/eQTLweights.txt.gz"
ld_file="${TIGAR_dir}/ExampleData/eQTLweights.txt.gz"
gwas_file="${TIGAR_dir}/ExampleData/sample_GWAS_Result.txt.gz"
${TIGAR_dir}/VC_TWAS_summary.sh \
--gene_anno ${gene_anno_file} \
--GWAS_result ${gwas_file} \
--Weight ${eQTL_ES_file} \
--sample_size ${sample_size} \
--weight_threshold 0.0001 \
--LD ${ld_file} \
--chr 1 \
--window 1000000 \
--thread 1 \
--out_dir ${out_dir}
Association study using cis- and trans-eQTL effect sizes estimate by BGW-TWAS.
--gene_list
: Gene list file to specify the names of genes--weight_prefix
: Start of path to SNP weight (eQTL effect size) file generated by BGTWAS. This string is the path (including directory) before the gene name.--weight_suffix
: End of the path to SNP weight (eQTL effect size) file generated by BGTWAS. This string is the path (including extension) after the gene name.--LD_prefix
: Start of path to LD file generated by TIGAR. This string is the path (including directory) before the chromosome number.--LD_suffix
: End of the path to LD file generated by TIGAR. This string is the path (including extension) after the chromosome number.--Zscore_prefix
: Start of path to Zscore summary statistics file. This string is the path (including directory) before the chromosome number.--Zscore_suffix
: End of the path to Zscore summary statistics file. This string is the path (including extension) after the chromosome number.--weight_threshold
: Threshold for estimated cis-eQTL effect sizes, filter SNPs with absolute cis-eQTL effect sizes smaller than threshold. (default:0
; Note that BGWTWAS already filters weights.)--thread
: Number of simultaneous processes to use for parallel computation (default:1
)--out_dir
: Output directory (will be created if it does not exist)--out_prefix
: Start of log and output file names. (default:BGWTWAS
)--out_twas_file
: Output file name. (default:${out_prefix}_asso.txt
,BGWTWAS_asso.txt
)--out_log_file
: Output log file. (default:${out_prefix}_log.txt
,BGWTWAS_log.txt
)--TIGAR_dir
: Specify the directory of TIGAR source code
out_dir=${TIGAR_dir}/ExampleData/
gene_list=${TIGAR_dir}/ExampleData/gene_list.txt
weight_prefix='${TIGAR_dir}/ExampleData/BGWTWASweights/'
weight_suffix='_BGW_eQTL_weights.txt'
LD_prefix='${TIGAR_dir}/ExampleData/CHR'
LD_suffix='_reference_cov.txt.gz'
Zscore_prefix='${TIGAR_dir}/ExampleData/CHR'
Zscore_suffix='_Zscore.txt.gz'
python ${TIGAR_dir}/TWAS/BGWTWAS_Asso_Study.py \
--gene_list ${gene_list} \
--weight_prefix ${weight_prefix} \
--weight_suffix ${weight_suffix} \
--Zscore_prefix ${Zscore_prefix} \
--Zscore_suffix ${Zscore_suffix} \
--LD_prefix ${LD_prefix} \
--LD_suffix ${LD_suffix} \
--thread 1 \
--TIGAR_dir ${TIGAR_dir} \
--out_dir ${out_dir}
- added
--missing_rate
option (for excluding SNPs with many missing values) to model training and prediction scripts(default:0.2
) - weight files now automatically sorted and bgzipped/tabixed during training step
- Removed 'dfply' Python package dependency
- Added VC-TWAS method
- Added
SPrediXcan
test statistic calculation to TWAS with summary-level GWAS data (in addition to existingFUSION
test statistic); added option (--test_stat
) to select which test statistic to use (default:both
) - Added
--weight-threshold
option (for excluding SNPs with small effect-sizes) to summary-level TWAS (default:0
) - Added
--sampleID
argument toTIGAR_LD.sh
- Reduced size of output LD files
- Improved memory usage of Python scripts (especially for model-training and TWAS with summary-level GWAS data)
- Improved speed (especially for model-training)
- Improved error handling for parallelized functions
- Added/improved log output in all Python scripts
- Added time elapsed calculation for logging
- Removed intermediate
call_DPR.sh
script - Added script to do summary-level association test with BGWTWAS weights.