Run the file setup_python_environment.sh to create and set up an anaconda/miniconda environment named collaborative-learning-with-syn-data.
The code is organized as follows:
code/
- data_processing_scripts/ # contains scripts to convert, filter and preprocess data obtained from the UK Biobank
- experiment_scripts/ # contains scripts to produce results of the paper experiments
- plotting_scripts/ # contains scripts to create the plots shown in the paper's figures
code/ also contains the following driver scripts:
paths_template.sh: A template file to set a number of environmental variables. See the next section in this file for further information.create_param_files.py: Run to create list of parameters required by the following scripts.run_01_infer_models.sh: Run to infer parameters of the generative models for all paper experiments.run_02_generative_twin_data.sh: Run to generative synthetic twin data from the generative models for all paper experiments.run_03_lls_over_num_shared.sh: Run to perform the analysis task on combined local and synthetic data for paper experiment for figures 1-4 and 6.run_04_lls_over_num_shared.sh: Run to perform the analysis task on combined local and synthetic data for all paper experiments for figure 5.run_05_plotting.sh: Run to create the plots in the paper.
If you have access to a system equipped with the SLURM workload manager, use the files prefixed slurm_ instead of run_.
All scripts require the following environmental variables to be set:
UKB_BASE_FOLDER: Path to the directory that contains all dataUKB_DATA_ID: Numerical data identifier designated by UKBUKB_PROJECT_ID: Numerical project identifier designated by UKB
Please set the corresponding values in code/paths_template.sh and rename or copy the file to code/paths.sh.
All scripts make the following assumptions about the layout of the data directory
$UKB_BASE_FOLDER/
- original_data/ # contains files directly obtained from UK Biobank and files derived using UKB conversion programs
- processed_data/ # contains filtered and preprocessed UK Biobank data, prepared for use in the experiments
- synthetic_data/ # contains learned generative models and synthetic twin data
$UKB_BASE_FOLDER/original_data/ contains the following files, where XYZ is a placeholder for your UKB designated data identifier.
ukbXYZ.enc: Encrypted raw UKB main data set. How to get: Apply to UKB with data fields listed in .... .ukbXYZ.enc_ukb: Decrypted raw UKB data in UKB's native format. How to get:ukbunpack ukbXYZ.enc <key>, where is provided by UKB via e-mail.ukbXYZ.csv: Raw UKB data in csv format. How to get:ukbconv ukbXYZ.enc_ukb csvencoding.ukb: UKB encoding specification file. How to get: Download from UKB:wget -nd biobank.ctsu.ox.ac.uk/crystal/util/encoding.ukbcolumns.pickle: Mapping of encoded values for columns in the data in our own Python format. How to get: Runcode/data_processing_scripts/extract_encodings.pywXYZ_<YYYYMMDD>.csv: List of participants that have withdrawn after obtaining original data set at given date. How to get: Regularly provided by UKB via e-mail.latest_withdrawals.csv: Last complete list of all withdrawn participants. How to get: Symlink to the latest wXYZ_*.csv file.covid19_results.tsv: Covid19 test data downloaded from UKB data portal. How to get: Apply for access to UKB and download using data portal (cf. Sec. 5 in Data Access Guide: https://biobank.ctsu.ox.ac.uk/~bbdatan/Accessing_UKB_data_v2.3.pdf )
$UKB_BASE_FOLDER/processed_data/ contains the following files:
model_one_full_span_data.csv: UK Biobank data reduced to relevant fields and converted to interpretable fields, for all individuals. How to get: Runcode/data_processing_scripts/preprocess_data_entire_span.pymodel_one_covid_tested_data.csv: Same as above, but only individuals for which at least one SARS-CoV-19 test result was present. How to get: Runcode/data_processing_scripts/preprocess_data_entire_span.pymodel_one_covid_tested_data_[train080|test020].csv: The previous, split into train and test sets. How to get: Runcode/data_processing_scripts/split_train_test.pymodel_one_covid_tested_data_train080_maxRR.csv: The train data split subsampled with a given ratio. How to get: Runcode/data_processing_scripts/subsample_center_data.py $UKB_BASE_FOLDER/processed_data/model_one_covid_tested_data_train080.csv $UKB_BASE_FOLDER/processed_data/ <subsample ratio, e.g., 0.2>
To populate the directories with the files initially required, follow the steps below:
- Download
ukbXYZ.encandcovid19_results.tsvfollowing the instructinos provided by UKB. - Download
encoding.ukband the UKB file format conversion programs from https://biobank.ndph.ox.ac.uk/ukb/download.cgi (cf. https://biobank.ctsu.ox.ac.uk/crystal/exinfo.cgi?src=accessing_data_guide) - If you were provided with a withdrawal file, create the
latest_withdrawals.csvsymbolic link (or copy and rename the original file). - Ensure the environment variables listed in the beginning of this file are set and the UKB conversion programs are in PATH.
- Run
code/run_00_process_data.sh <key>, where is provided by UKB via e-mail.- This script bundles operations to decrypt, convert and filter UKB data to create all data files in
original_dataandprocessed_datathat are required for running the experiments.
- This script bundles operations to decrypt, convert and filter UKB data to create all data files in