This repository contains comprehensive equity bias analyses across three major biomedical research domains: neuropsychiatric disorders (PsychAD), cancer research (HTAN), and cell atlas studies (HCA).
- R (version 4.0 or higher)
- RStudio (recommended)
Run the following commands in R or RStudio to install all dependencies:
# Install required packages
required_packages <- c(
"ggplot2",
"dplyr",
"tidyr",
"stringr",
"scales",
"readxl",
"car"
)
# Check which packages need to be installed
missing_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
# Install missing packages
if(length(missing_packages) > 0) {
install.packages(missing_packages)
}
# Load all packages to verify installation
lapply(required_packages, library, character.only = TRUE)# Quick test to ensure all packages load correctly
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(scales)
library(readxl)
library(car)
cat("All packages installed successfully!\n")Ensure your data files are placed in the data/ directory:
data/psych-AD_media-1.csv(for PsychAD analysis)data/HTAN scrnaseq data final.xlsx(for HTAN analysis)data/HCA scrnaseq data final.xlsx(for HCA analysis)
equity_bias_analysis_PsychAD.R- Analysis of psychAD media datasetequity_bias_analysis_HTAN.R- Analysis of HTAN cancer datasetequity_bias_analysis_HCA.R- Analysis of Human Cell Atlas dataset
All three analyses standardize ancestry/race variables to consistent categories:
PsychAD Dataset:
# Maps original ancestry codes to standardized categories
AFR → African
AMR → Latino
EAS, SAS → Asian (combined East Asian and South Asian)
EUR → European
EAS_SAS → Asian (mixed category)
Unknown/NA → Unknown
Other values → OtherHTAN Dataset:
# Uses race and ethnicity columns with Latino precedence
Hispanic/Latino ethnicity → Latino (regardless of race)
White (non-Latino) → European
Black/African American → African
Asian → Asian
Other → Other
Not reported/Unknown/NA → UnknownHCA Dataset:
# Uses ethnicity ontology aggregated field
european → European
african → African
asian → Asian
hispanic or latino → Latino
mixed → Other
unknown/not provided/NA → UnknownPsychAD Dataset:
# Uses 'sex' column (already clean)
female → female
male → male
# No missing values in this datasetHTAN Dataset:
# Uses 'Gender' column
Female → female
Male → male
Unknown/Not Reported/NA → NAHCA Dataset:
# Uses 'donor_organism.sex' column
female → female
male → male
unknown/yes/homo sapiens/NA → NAPsychAD Dataset:
- Combines cross-disorder and single diagnosis variables
- Creates unified disease categories (AD, SCZ, DLBD, etc.)
- Adds control category for samples without any listed diseases
HTAN Dataset:
- Extracts cancer types from Primary Diagnosis field
- Maps to standardized cancer categories (Breast, Lung, Colorectal, etc.)
HCA Dataset:
- Uses tissue sheet names as primary tissue type classification
- Maintains original tissue categorization from HCA
# Run individual analyses
Rscript equity_bias_analysis_PsychAD.R
Rscript equity_bias_analysis_HTAN.R
Rscript equity_bias_analysis_HCA.RAll analyses generate outputs in the out/ directory:
- PDF visualizations
- Statistical analysis logs
- R workspace files for further analysis