This repository contains information and scripts to characterize uncertainty in predictions of genomic sequence-to-activity models. It makes use of scripts from the Basenji repository for training a deep ensemble of Basenji2 models, generating, and evaluating predictions.
Command used to train each model replicate:
basenji_train.py -k -o ${out_dir}/train/rep_${replicate_model}/ ${out_dir}/models/params_human.json ${data_dir}/human
Necessary data and resources:
- Basenji2 training, validation and test data can be downloaded from Google Cloud (link). Note: This data is ~320 GB and is in a requester pays bucket.
params_human.jsoncan be found here
Command used to generate test set predictions for each model replicate:
basenji_test.py --save --rc --shifts "1,0,-1" -t ${data_dir}/human/targets.txt \
-o ${out_dir}/test/rep_${replicate_model}/ \
${out_dir}/models/params_human.json ${out_dir}/train/rep_${replicate_model}/model_best.h5 ${data_dir}/human
The same data and resources are necessary as the training step above.
Command used to generate predictions for each model replicate and each tissue:
for eqtl_set in pos neg;
do
basenji_sad.py --rc --shifts -1,0,1 --stats SAD,REF,ALT \
-o ${out_dir}/preds/basenji2_${replicate_model}_${tissue}_${eqtl_set} \
-t ${data_dir}/human/targets.txt -f ${hg38_fasta} \
${out_dir}/models/params_human.json \
${out_dir}/train/rep_${replicate_model}/model_best.h5 \
${gtex_dir}/vcf/${tissue}_${eqtl_set}.vcf
done
Necessary data and resources:
- GTeX SuSie fine-mapped eQTL data from Wang et al. (2021) and Avsec et al. (2021) can be downloaded from Google Cloud (link)