A Systematic Clustering Analysis for Binary Multimorbidity Data

This is the code repository corrresponding to the systematic clustering analysis within our article, "Investigating associations between physical multimorbidity clusters and subsequent depression: cluster and survival analysis of UK Biobank data".

Introduction

Perform a systematic clustering analysis which explores four different approaches upon your binary data.

You can also visualize your results using adjusted relative frequencies on a bubble heatmap:

➡️ Jump to Visualization Section

Installation

To begin, create a Python environment and install the package.

Clone and navigate to the repository:

git clone https://github.com/laurendelong21/clusterMed.git
cd clusterMed

create and activate a Python environment (replace $ENV_NAME$ with your desired environment name):

either with venv:

python3 -m venv $ENV_NAME$
source $ENV_NAME$/bin/activate

or with conda (choose python>=3.7):

conda create --name $ENV_NAME$ python=3.7
conda activate $ENV_NAME$

and install the clustr package:

pip install .

Usage

File format

This package operates upon tab-separated files (.tsv).

The column names should indicate the conditions.
The rows should denote participants. The first column should be the patient ID (or some generic index number).
- (In pandas, this would be the following):
```
pd.read_csv(input_file, sep='\t', index_col=0)
```
The multimorbidity data should be binary (zero or one) indicating whether each participant (row) has a condition (column).

Dummy File

If you wish to test out the functionality of the clustr package, you can create a dummy file using generate_dummy_data.sh.

Assuming you're already in the repository's directory,

cd clusterMed

Run the script in the terminal. It will prompt you to enter the nunber of columns (conditions) and rows (participants) you want. If you just press enter, it will generate a dummy file with 10,000 faux participants and 50 condition columns.

./generate_dummy_data.sh

Tutorial

tutorial.ipynb gives an overview of how you can use the clustr package within your Python code.

To run the notebook, you need to do a few things first.

First, create a dummy file by running generate_dummy_data.sh. Directions are above.
Connect your virtual environment as a kernel. Depending upon what you're using (Jupyter notebook, VSCode, etc.), there are various instructions online for doing so. It should be something like activating the environment, then:
```
 python -m ipykernel install --user --name=$ENV_NAME$
```
With your environment as the kernel, use the notebook. Be careful with changing the proportion of the file used- these methods can be computationally expensive with large files, so consider whether your machine can handle working with the whole dummy file.

Command Line Interface (CLI)

These clustering methods can be computationally expensive for many participants and conditions. You may want to run these on a remote server with a CLI.

To use the CLI, you only need the clustr package installed and your data files ready. In the examples given, we use the dummy file, as above.

Use one of the commands, following the command clustr.

Read further for options and examples.

Commands Available:

command	description
agg	Performs agglomerative hierarchical clustering on an input file.
lcaselect	Helps facilitate model selection for LCA using BIC criterion.
lca	Performs Latent Class Analysis on an input file.
kmeselect	Helps facilitate model selection for k-medoids using a scree plot.
kmedoids	Performs k-medoids clustering on an input file.
kmoselect	Helps facilitate model selection for k-modes using a scree plot.
kmodes	Performs k-modes clustering on an input file.

Below, we describe how to use each of these commands with examples.

agg

Performs agglomerative hierarchical clustering on an input file. The input file must be in the specified format. Use clustr agg --help for more details.

option	description
-i / --infile	the input filepath; recommended to store within the 'data' directory
-b / --subdir	denotes a subdirectory to create and write to, such as 'women'
-m / --metric	the metric to be used for clustering (default is hamming distance)
-l / --linkage	the type of linkage to be used for clustering (default is complete)
-s / --sample_frac	the fraction of the dataset to use (default is 1, so 100%)
-dh / --drop_healthy	whether to drop those who have no conditions (default is False)
-c / --coi	the name of some condition of interest (e.g. 'Depression') which is taken out of the analysis

For example,

clustr agg -i ./data/dummy_data.tsv -dh True -s 0.05 -c disease_3

In the above, we will do agglomerative hierarchical clustering upon 5% of the rows in the dummy data file, after dropping those with no conditions (all zeroes). Before those steps, however, we are taking out disease_3 and saving the labels for this condition separately.

lcaselect

Helps facilitate model selection for LCA using BIC criterion. The input file must be in the specified format. Use clustr lcaselect --help for more details.

option	description
-i / --infile	the input filepath; recommended to store within the 'data' directory
-b / --subdir	denotes a subdirectory to create and write to, such as 'women'
-mi / --min_k	the minimum number k clusters to investigate (default is 2)
-ma / --max_k	the maximum number k clusters to investigate (default is 10)
-s / --sample_frac	the fraction of the dataset to use (default is 1, so 100%)
-dh / --drop_healthy	whether to drop those who have no conditions (default is False)
-c / --coi	the name of some condition of interest (e.g. 'Depression') which is taken out of the analysis

For example,

clustr lcaselect -i ./data/dummy_data.tsv -dh True -s 0.05 -c disease_3 -mi 2 -ma 5

In the above, we will investigate what number of classes, k, works best with latent class analysis upon 5% of the rows in the dummy data file. First, we are taking out disease_3 and saving the labels for this condition separately , and we are dropping those with no conditions (all zeroes).

Specifically, we are investigating k within the range of [2, 5].

lca

Performs Latent Class Analysis on an input file. The input file must be in the specified format. Use clustr lca --help for more details.

option	description
-i / --infile	the input filepath; recommended to store within the 'data' directory
-b / --subdir	denotes a subdirectory to create and write to, such as 'women'
-r / --repetitions	number of times to run the clustering method; this is due to the sensitivity of initialization (default is 1)
-k / --kclusters	k number of clusters (default is 10)
-s / --sample_frac	the fraction of the dataset to use (default is 1, so 100%)
-dh / --drop_healthy	whether to drop those who have no conditions (default is False)
-c / --coi	the name of some condition of interest (e.g. 'Depression') which is taken out of the analysis

For example,

clustr lca -i ./data/dummy_data.tsv -dh True -s 0.05 -c disease_3 -k 10 -r 5

In the above, we will do latent class analysis upon 5% of the rows in the dummy data file, after dropping those with no conditions (all zeroes). Before those steps, however, we are taking out disease_3 and saving the labels for this condition separately.

Here, we are looking for 10 classes, and repeating this analysis 5 times. Execution will automatically make five subdirectories within the LCA results folder, and each subdirectory will comprise individual results.

kmeselect

Helps facilitate model selection for k-medoids using a scree plot. The input file must be in the specified format. Use clustr kmeselect --help for more details.

option	description
-i / --infile	the input filepath; recommended to store within the 'data' directory
-b / --subdir	denotes a subdirectory to create and write to, such as 'women'
-mi / --min_k	the minimum number k clusters to investigate (default is 2)
-ma / --max_k	the maximum number k clusters to investigate (default is 10)
-s / --sample_frac	the fraction of the dataset to use (default is 1, so 100%)
-dh / --drop_healthy	whether to drop those who have no conditions (default is False)
-c / --coi	the name of some condition of interest (e.g. 'Depression') which is taken out of the analysis

For example,

clustr kmeselect -i ./data/dummy_data.tsv -dh True -s 0.05 -c disease_3 -mi 2 -ma 5

In the above, we will investigate what number of clusters, k, works best with k-medoids clustering upon 5% of the rows in the dummy data file. First, we are taking out disease_3 and saving the labels for this condition separately , and we are dropping those with no conditions (all zeroes).

Specifically, we are investigating k within the range of [2, 5].

kmedoids

Performs k-medoids clustering on an input file. The input file must be in the specified format. Use clustr kmedoids --help for more details.

option	description
-i / --infile	the input filepath; recommended to store within the 'data' directory
-b / --subdir	denotes a subdirectory to create and write to, such as 'women'
-k / --kclusters	k number of clusters (default is 10)
-s / --sample_frac	the fraction of the dataset to use (default is 1, so 100%)
-dh / --drop_healthy	whether to drop those who have no conditions (default is False)
-c / --coi	the name of some condition of interest (e.g. 'Depression') which is taken out of the analysis

For example,

clustr kmedoids -i ./data/dummy_data.tsv -dh True -s 0.05 -c disease_3 -k 10 -b 02_05_2024

In the above, we will do k-medoids clustering upon 5% of the rows in the dummy data file, after dropping those with no conditions (all zeroes). Before those steps, however, we are taking out disease_3 and saving the labels for this condition separately.

Here, we are looking for 10 clusters, and we have requested that results are written into a subdirectory of the k-medoids results folder, called 02_05_2024.

kmoselect

Helps facilitate model selection for k-modes using a scree plot. The input file must be in the specified format. Use clustr kmoselect --help for more details.

option	description
-i / --infile	the input filepath; recommended to store within the 'data' directory
-b / --subdir	denotes a subdirectory to create and write to, such as 'women'
-mi / --min_k	the minimum number k clusters to investigate (default is 2)
-ma / --max_k	the maximum number k clusters to investigate (default is 10)
-s / --sample_frac	the fraction of the dataset to use (default is 1, so 100%)
-dh / --drop_healthy	whether to drop those who have no conditions (default is False)
-c / --coi	the name of some condition of interest (e.g. 'Depression') which is taken out of the analysis

For example,

clustr kmoselect -i ./data/dummy_data.tsv -dh True -s 0.05 -c disease_3 -mi 2 -ma 5

In the above, we will investigate what number of clusters, k, works best with k-modes clustering upon 5% of the rows in the dummy data file. First, we are taking out disease_3 and saving the labels for this condition separately , and we are dropping those with no conditions (all zeroes).

Specifically, we are investigating k within the range of [2, 5].

kmodes

Performs k-modes clustering on an input file. The input file must be in the specified format. Use clustr kmodes --help for more details.

option	description
-i / --infile	the input filepath; recommended to store within the 'data' directory
-b / --subdir	denotes a subdirectory to create and write to, such as 'women'
-r / --repetitions	number of times to run the clustering method; this is due to the sensitivity of initialization (default is 1)
-k / --kclusters	k number of clusters (default is 10)
-s / --sample_frac	the fraction of the dataset to use (default is 1, so 100%)
-dh / --drop_healthy	whether to drop those who have no conditions (default is False)
-c / --coi	the name of some condition of interest (e.g. 'Depression') which is taken out of the analysis

For example,

clustr kmodes -i ./data/dummy_data.tsv -dh True -s 0.05 -c disease_3 -k 10 -r 5

In the above, we will do k-modes clustering upon 5% of the rows in the dummy data file, after dropping those with no conditions (all zeroes). Before those steps, however, we are taking out disease_3 and saving the labels for this condition separately.

Here, we are looking for 10 clusters, and repeating this analysis 5 times. Execution will automatically make five subdirectories within the k-modes results folder, and each subdirectory will comprise individual results.

Visualization

Visualize the results of your cluster with a bubble heatmap! See the bubble_heatmap_tutorial.ipynb as a guide for visualizing your results.

As in our paper, a bubble heatmap:

places ARF values on a grid in which the y-axis contains conditions,
the x-axis contains clusters,
and data points are colored blue (under-representation) or red (over-representation) at each intersection.

How to read a bubble heatmap:

The magnitude of under- or over-representation is indicated by the size of the data point, or bubble.
Points which are not statistically significant were omitted.
- Therefore, conditions with no significant values were omitted entirely from the y-axis.
Furthermore, for visualization purposes, the ARF values were adjusted so that values denoting under-representation (between zero and one) were mapped to a similar scale as those denoting over-representation (values greater than one).

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
clustr		clustr
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bubble_heatmap_tutorial.ipynb		bubble_heatmap_tutorial.ipynb
generate_dummy_data.sh		generate_dummy_data.sh
setup.py		setup.py
test.py		test.py
tutorial.ipynb		tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Systematic Clustering Analysis for Binary Multimorbidity Data

Introduction

Installation

Usage

File format

Tutorial

Command Line Interface (CLI)

Visualization

About

Releases 1

Packages

Contributors 2

Languages

License

laurendelong21/clusterMed

Folders and files

Latest commit

History

Repository files navigation

A Systematic Clustering Analysis for Binary Multimorbidity Data

Introduction

Installation

Usage

File format

Tutorial

Command Line Interface (CLI)

Visualization

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages