This repository contains scripts for analyzing light modulation simulations and generating figures for publication.
# Install all dependencies
pip install -r requirements.txt
# Or install manually
pip install huggingface_hub pandas numpy matplotlib scipy scikit-learn plotly tqdm pyyaml shapNote: The shap package is required for running shap_feature_ranking.py. All other dependencies are required for the core analysis pipeline.
The data folder (containing JT curves, features, and analysis outputs) is hosted on Hugging Face Hub. This allows for version-controlled, accessible data storage and easy sharing of the dataset.
For public repositories, no authentication is required. For private repositories, you need to authenticate:
# Option 1: Use Hugging Face CLI (recommended)
huggingface-cli login
# Option 2: Set environment variable
export HF_TOKEN=your_token_here
# Option 3: Pass token directly to script
python download_data_from_hf.py --repo-id <username/dataset-name> --token your_token_here# Download data from Hugging Face
python download_data_from_hf.py --repo-id <username/dataset-name>
# Example:
# python download_data_from_hf.py --repo-id your-username/light-modulation-data
# Additional options:
# --data-dir: Specify target directory (default: 'data')
# --filename: Specify archive filename (default: 'data.tar.gz')
# --force: Overwrite existing data directory without prompting
# --keep-archive: Keep the downloaded archive file after extractionThe script will:
- Download the compressed data archive (
data.tar.gz) from Hugging Face Hub - Extract it to the
data/folder in the git root - Verify the extraction and check for expected subdirectories
Note: The data folder should be located at the git root level (data/) for all scripts to work correctly. The download script automatically places it in the correct location.
The downloaded data archive typically contains:
modulation_profile/: Light modulation waveform files (sinusoidal, square)all_jt_curves/: Complete set of Johnson-Taylor (JT) current-time curve filesjt_features_out/: Extracted JT cycle features (peak values, rise/fall times, etc.)fft_features_out/: FFT frequency analysis features (fundamental frequency, harmonics, THD, etc.)figures/: Pre-generated analysis plots and figuresgraspi_features.csv: Morphology features extracted from simulation images
To upload the data folder to Hugging Face Hub for sharing and version control:
- Create a Hugging Face account at huggingface.co
- Create a dataset repository on Hugging Face Hub:
- Go to your profile → "New dataset"
- Choose a name and visibility (public/private)
- The repository type should be "dataset"
- Authenticate using one of the methods described in the Downloading section above
# Upload data to Hugging Face
python upload_data_to_hf.py --repo-id <username/dataset-name>
# Example:
# python upload_data_to_hf.py --repo-id your-username/light-modulation-data
# Additional options:
# --data-dir: Specify source data directory (default: 'data')
# --output-archive: Specify output archive path (default: 'data.tar.gz' in git root)
# --skip-archive: Use existing archive file instead of creating new one
# --keep-archive: Keep the archive file after upload
# --token: Hugging Face token (or set HF_TOKEN env var)The script will:
- Create a compressed tar.gz archive of the
data/folder - Upload the archive to the specified Hugging Face dataset repository
- Create the repository if it doesn't exist (requires appropriate permissions)
Note: The repository will be created automatically if it doesn't exist. Make sure you have the necessary permissions on your Hugging Face account.
- Version control: Each upload creates a new version. Use descriptive commit messages when creating the repository.
- File size: Large datasets (>5GB) may require Git LFS. Hugging Face Hub handles this automatically for large files.
- Privacy: Use private repositories for sensitive data. Public repositories are accessible to everyone.
- Documentation: Add a README.md in your Hugging Face dataset repository to document the data structure and usage.
Scripts for generating publication-quality figures and analysis:
-
ablation_study_and_parity_plot.py: Generates ablation study plots showing feature importance and parity plots comparing model predictions vs. actual values for various target variables (interfacial area, domain size, connectivity, donor fraction). -
jt_fft_single_plot.py: Creates annotated JT (current-time) cycle plots and FFT spectra plots for both sinusoidal and square light modulation waveforms, showing peak detection, rise/fall times, and harmonic analysis. -
shap_feature_ranking.py: Performs SHAP (SHapley Additive exPlanations) feature ranking analysis to interpret machine learning model predictions and identify the most important features for each target variable. -
config.yaml: Configuration file containing paths, JT analysis parameters, FFT analysis parameters, and plotting settings. -
utils.py: Utility functions for data loading, JT cycle analysis, FFT analysis, plotting, and feature extraction.
Scripts for running simulations and collecting data:
-
coreset_selection.py: Selects a representative coreset (subset) of data from the full dataset using PCA dimensionality reduction and distance-based selection methods to minimize redundancy while preserving data distribution. -
setup_run.py: Sets up simulation run directories from coreset-selected files, creating the necessary directory structure (checkpoint_creation, sinusoidal, square) and copying required input files for each run. -
submit_jobs.sh: Submits simulation jobs to a cluster/scheduler (e.g., SLURM) for batch processing. Supports specifying job ranges and manages job submission limits. -
collect_jt_curves.sh: Collects JT curve data files from all simulation run directories and organizes them into a central location for analysis. Can process a specified range of runs.