This dataset consists of 2521 samples which have genetic data based on 1000 Genomes data, and synthetic subject attributes and phenotypic data derived from UKBiobank. These data were initially derived using the TOFU tool, which generates randomly generated values based on the UKBiobank data dictionary. Categorical values were randomly generated based on the data dictionary, continuous variables generated based on the distribution of values reported by the UK Biobank showcase, and date / time values were random. Additionally we split the phenotypes and attributes into 4 main classes - general, cancer, diabetes mellitus, and cardiac. We assigned the general attributes to all the samples, and the cardiac / diabetes mellitus / cancer attributes to a proportion of the total samples. Once the initial set of phenotypes and attributes were generated, the data was checked for consistency and where possible dependent attributes were calculated from the independent variables generated by TOFU. For example, BMI was calculated from height and weight data, and age at death generated by date of death and date of birth. These data were then loaded to the development instance of Biosamples which accessioned each of the samples.
The genetic data are derived from the 1000 Genomes Phase 3 release. The genotype data consists of a single joint call vcf file with call genotypes for all 2504 samples, plus bed, bim, fam, and nosex files generated via plink for these samples and genotypes. The genotype data has had a variety of errors introduced to mimic real data and as a test for quality control pipelines. These include gender mismatches, ethnic background mislabelling and low call rates for a randomly chosen subset of sample data as well as deviations from Hardy Weinberg equilibrium. Additionally 40 samples have raw genetic data available in the form of both bam and cram files, including unmapped data. The gender of the samples in the 1000 genomes data has been matched to the synthetic phenotypic data generated for these samples. The genetic data was then linked to the synthetic data in BioSamples, and submitted to EGA.
Phenotypic data for CINECA_synthetic_cohort_EUROPE_UK1 are publicly available at the dev instance of BioSamples
database. At the moment there are 2504 synthetic sample entries adhering to the UKB data distribution. They all have a
project
attribute set to UKB_SYNTHETIC_DATA
and therefore can be easily filtered out by applying the filter
attr:project:UKB_SYNTHETIC_DATA
on the BioSamples.
- Description: doc, ppt
- Google Drive
- BioSamples
- EGA
The H3ABioNet synthetic data was created using a modified version of tofu (https://github.com/spiros/tofu), using a database of fields and values/encodings.
We used the H3Africa Core phenotype data dictionary to generate new database fields and encodings to mimic African data.
We used a modified version of the TOFU to generate the metadata for 100 samples. All sample identifiers are prefaced with 'fake' to avoid confusion with real datasets. We constructed a database of fields and values using the H3Africa Core phenotype, a set of recommended questions or variables that H3Africa should consider when designing their data collection forms. We selected a group of 170 variables out of the 255 H3Africa Core phenotypes to provide a good overlap with the CINECA core metadata model. We randomly chose categorical values from field choices from the H3Africa Core phenotype data dictionary. Continuous values, such as age and date of birth, were randomly selected from field ranges from the H3Africa Core phenotype data dictionary. The phenotypic data can be accessed from the public Google Drive folder.
More information on the phenotypic data variables (description, data type, examples) can be found in this document that describes the coverage with the CINECA minimal metadata model.
We used the 1000 Genomes project phase 3 data to generate genetic data. From the 2504 samples included in the 1000 Genomes project, we randomly selected 100 samples of African ancestries. We then used BCFTools to replace the 100 sample identifiers with the ones in the metadata file. We randomly selected 2M variants in chromosome 22 using BCFTools. The Genetic data can be accessed from the public Google Drive folder.
- The CINECA_synthetic_cohort_Africa_H3ABioNet1 data is currently accessible via the CINECA synthetic data’s project google drive folder.
- This data will also be made available through the H3ABioNet Beacon test instance at https://beacon2.h3abionet.org
Both the genetic data and metadata are fully accessible under the Creative Commons Licence (CC-BY). Any use of this data should, however, thoroughly consider the following:
- The metadata conforms to the structure and schema of the H3Africa Core phenotype, but it is otherwise nonsensical: no checks have been implemented across fields, and values may be completely unrealistic.
- We did not model any correlation between fields. There is, however, a plan to model a correlation on a few variables such as weight and height.
- This dataset should not be used to make any inference whatsoever as the values of the fields do not entirely reflect reality.
- Dates randomly generated are between 1910 and 1990 to avoid confusion with real data.
- This synthetic data set (with cohort “participants” / ”subjects” marked with fake) has no identifiable data and cannot be used to make any inference about H3Africa cohort data or results. This dataset aims to aid the development of technical implementations for cohort data discovery, harmonisation, access, and federated analysis. In support of FAIRness in data sharing, this dataset is made freely available under the Creative Commons Licence (CC-BY). Please ensure this preamble is included with this dataset and that the H3Africa project and the CINECA project (funding: EC H2020 grant 825775 and CIHR grant 404896) are acknowledged. If you have any questions about this dataset, please contact Mamana Mbiyavanga ([email protected]) or Nicola Mulder ([email protected]).