YOUth pilot privacy-friendly synthetic data

This repository implements a pilot for creating privacy-friendly questionnaire datasets from the YOUth cohort. It is built on metasyn with the disclosure control plugin.

Installation

To install the dependencies of this project, follow the following steps:

We use uv to manage dependencies and environments. Install it first.
Clone this repository.
Instantiate the environment by running uv sync from this folder.

Synthesizing data

Obtain the following datasets from the YOUth study and put them in the raw_data folder: CECPAQ_2.csv, M_DEMOGRAFY_1.csv, P_DEMOGRAFY_1.csv, P_LIFSTYLE_1_MED_STOREY.csv, P_LIFSTYLE_1_MEDICATIONY.csv, P_LIFSTYLE_1.csv, Q_1.csv
Obtain the following metadata files and put them in the raw_data\metadata folder: YOUth_baby_en_kind-metadata.csv, YOUth_baby_en_kind-valuelabels.csv.
Create the synthetic data by running uv run synthesize.py

Now, the folders output/csv and output/gmf should be populated with synthetic data and metadata, respectively:

📁 synthetic_youth_pilot/
├── 📖 README.md
├── 📄 test_analysis.py
├── 📄 synthesize.py
├── pyproject.toml
├── uv.lock
├── 📁 raw_data/
│   ├── 📜 CECPAQ_2.csv
│   ├── 📜 M_DEMOGRAFY_1.csv
│   ├── 📜 P_DEMOGRAFY_1.csv
│   ├── 📜 P_LIFSTYLE_1.csv
│   ├── 📜 P_LIFSTYLE_1_MEDICATIONY.csv
│   ├── 📜 P_LIFSTYLE_1_MED_STOREY.csv
│   ├── 📜 Q_1.csv
│   └── 📁 metadata/
│       ├── 📜 YOUth_baby_en_kind-metadata.csv
│       └── 📜 YOUth_baby_en_kind-valuelabels.csv
└── 📁 output/
    ├── 📁 csv/
    │   ├── 📜 CECPAQ_2.csv
    │   ├── 📜 M_DEMOGRAFY_1.csv
    │   ├── 📜 P_DEMOGRAFY_1.csv
    │   ├── 📜 P_LIFSTYLE_1.csv
    │   ├── 📜 P_LIFSTYLE_1_MEDICATIONY.csv
    │   ├── 📜 P_LIFSTYLE_1_MED_STOREY.csv
    │   └── 📜 Q_1.csv
    └── 📁 gmf/
        ├── 📜 CECPAQ_2.json
        ├── 📜 M_DEMOGRAFY_1.json
        ├── 📜 P_DEMOGRAFY_1.json
        ├── 📜 P_LIFSTYLE_1.json
        ├── 📜 P_LIFSTYLE_1_MEDICATIONY.json
        ├── 📜 P_LIFSTYLE_1_MED_STOREY.json
        └── 📜 Q_1.json

5 directories, 28 files
📖README 📜Data 📄Code 📁Folder

(Made with scitree)

Test analysis

This repo includes a test analysis on both the real and synthetic data to display medication use by age bracket. You can find the analysis in the file test_analysis.py. To run this analysis, run uv run test_analysis.py. It will show something like the following:

Paracetamol use in real data:

Age: 10 - 19 | ____ ███████████████████
Age: 20 - 29 | ____ ████████████████
Age: 30 - 39 | ____ ███████████████
Age: 40 - 49 | ____ ██████████████
Age: 50 - 59 | ____ █████████


Paracetamol use in synthetic data:

Age: 10 - 19 | 0.81 ████████████████████
Age: 20 - 29 | 0.81 ████████████████████
Age: 30 - 39 | 0.78 ███████████████████
Age: 40 - 49 | 0.74 ██████████████████
Age: 50 - 59 | 0.81 ████████████████████

(numbers redacted & bars fuzzed in real data analysis for privacy)

Three things are noteworthy here:

The analysis code is exactly the same between the synthetic and real analyses
The ranges of the individual variables (age and paracetamol use) are similar
The relation between age and paracetamol use is removed from the synthetic data

Contact

This is a project by the ODISSEI Social Data Science team. Do you have questions, suggestions, or remarks on the technical implementation? Create an issue in the issue tracker or feel free to contact Erik-Jan van Kesteren.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs		docs
output		output
raw_data		raw_data
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
synthesize.py		synthesize.py
test_analysis.py		test_analysis.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YOUth pilot privacy-friendly synthetic data

Installation

Synthesizing data

Test analysis

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

YOUth pilot privacy-friendly synthetic data

Installation

Synthesizing data

Test analysis

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages