├── wsi_data_pipeline
├ ├── README.md
├ ├── HistoGPT <- git clone https://github.com/marrlab/HistoGPT.git
├ ├── assets <- such result files (.png)
├ ├── histogpt_install_setup <- such files overwrite some config files in the HistoGPT folder
├ ├── requirements.txt <- specify the external Python packages and their specific versions that the project depends on
├ ├── .dvc <- (optional) such files related to data version control are under DVC, automatically
├ ├ generated by the cmd `dvc init` (more details in the following)
├ ├── config <- any .yml configuration files related to MLOps
├ ├── notebooks <- jupyter notebooks for demo experiments
├ ├── logs <- log files
├ ├── model_checkpoints <- checkpoints (.pth format) and related config files
├ │ ├── biogpt
├ │ ├── ctranspath
├ │ ├── histogpt
├ ├── data
├ │ ├── interim <- data in intermediate processing stage
├ │ ├ ├── patches_and_embeds <- save generated patches and embeddings for WSIs
├ │ ├── processed <- data after all preprocessing has been done
├ │ ├ ├── wsi_texts <- save generated clinical reports for WSIs
├ │ ├ ├── result.csv <- aggrated all generated clinical reports in a .csv file.
├ │ └── raw <- original unmodified WSI images (.svs or .ndpi formats)
├ └── src
├ ├ ├── data <- scripts of data preparing and/or preprocessing
├ ├ ├── logger <- scripts of logging configs
├ ├ ├── evaluate <- scripts of evaluating model (#TODO)
├ ├ ├── pipelines <- scripts of pipelines (#TODO)
├ ├ ├── report <- scripts of visualization (often used in notebooks) (#TODO)
├ ├ ├── train <- scripts of training model (#TODO)
├ ├ └── utils.py <- auxiliary functions and classes (#TODO)
├ └── dvc.yaml (optional) <- include all configs related pipelines and stages
├ ├ under DVC management (more details in the following)
conda create -n wsi_data_pipeline python=3.10
conda activate wsi_data_pipeline
cd wsi_data_pipeline
git clone https://github.com/marrlab/HistoGPT.git
cp -f ./histogpt_install_setup/setup.py ./HistoGPT/setup.py
## install histogpt
cd HistoGPT
pip install .
cd wsi_data_pipeline
pip install -r requirements.txt
- Download different WSI image
zip
files,unzip
them and get the WSI image in.svs
format. - Change the
.svs
suffix into.ndpi
. - Store them into the
wsi_data_pipeline/data/raw/
folder.
Go to the following Link to download related models
or config
files.
- Link for ctranspath and histogpt
- Download
ctranspath.pth
and move it to themodel_checkpoints/ctranspath/
folder. - Download
histogpt-1b-6k-pruned.pth
and move it to themodel_checkpoints/histogpt/
folder. - Download other
.json
files related to the configs ofhistogpt
and move them to themodel_checkpoints/histogpt/
folder.
- Download
- Link for biogpt
- Download
config.json
,merges.txt
,vocab.json
files related to the tokenizer forprompt
encoding and move them to themodel_checkpoints/biogpt/
folder.
- Download
If you do not use DVC
for data management, you can skip this step.
1) Install DVC
pip install dvc
Link for installation instructions
2) Initialize DVC init
!! Before you initialize DVC, you must use git init
to initialize the repository in GitHub
that you want to use for DVC.
Initialize DVC
cd wsi_data_pipeline
dvc init
Commit dvc init
git commit -m "Initialize DVC"
3) Add remote storage for DVC (any cloud or local folder)
# cd wsi_data_pipeline
dvc config cache.type copy
dvc remote add -d my_storage /tmp/dvc_storage # local-storage
4) Add data fnder DVC
management (any cloud or local folder)
!! You can add any other data you want under DVC
management, but when you add them under DVC
management, you must move them from the .gitignore
in the root
of the project.
## commit raw data to Git
dvc add data/raw/
git add data/raw.dvc .gitignore
git commit -m "Add raw data under DVC management"
git push -u origin main
## push data to remote storage
dvc push
1) Set configs
in the config/wsi_data_pipe_config.yml
file.
save_patch_image: bool
is used to control whether the process needs to save the generated patches for each WSI file in thedata/interim/patches_and_embeds/patches/
folder.
2) Two ways to Run
- Run with the followng
cmd
:
cd wsi_data_pipeline
python src/pipelines/wsi_data_pipe.py --config config/wsi_data_pipe_config.yml
- Run the
Jupyter
notebook with the step by step:
cd wsi_data_pipeline/notebooks
3) The final generated clinic reports are data/processed/
folder, the (on-demand) generated patches and patch embeddings are in the data/interim/patches_and_embeds/
folder.
1) Generate patches and embeddings
dvc stage add -n generate_wsi_patches_and_embed \
-d src/data/generate_wsi_patches_and_embed.py \
-d config/wsi_data_pipe_config.yml \
-d data/raw \
python src/data/generate_wsi_patches_and_embed.py --config config/wsi_data_pipe_config.yml
this stage:
- Generate embeddings and patches (optional) of each WSI using
CTranspath
model. - Store embeddings for text generation step in
data/interim/patches_and_embeds/h5_files/
folder and patches (on demand) indata/interim/patches_and_embeds/patches/
folder.
Reproduce stage: dvc repro --single-item generate_wsi_patches_and_embed
2) Generate Clinic Report Texts from generated WSI embeddings
dvc stage add -n generate_wsi_text_with_patches_and_prompt_embeds \
-d src/data/generate_wsi_texts.py \
-d config/wsi_data_pipe_config.yml \
python src/data/generate_wsi_texts.py --config config/wsi_data_pipe_config.yml
this stage:
- Generate clinical reports for each WSI image using
prompt
embeddings andpatch
embeddings - Store each clinical in
data/processed/
folder within the name of$wsi_name.txt
.
Reproduce stage: dvc repro --single-item generate_wsi_text_with_patches_and_prompt_embeds
3) Aggregate all generated Clinic Report Texts
dvc stage add -n aggregate_all_wsi_texts \
-d src/data/aggregate_all_wsi_texts.py \
-d config/wsi_data_pipe_config.yml \
-d data/processed/clinical_reports \
python src/data/aggregate_all_wsi_texts.py --config config/wsi_data_pipe_config.yml
this stage: Aggregate all clinical report texts into a final summarized output. The results are persisted as the final processed data.
Reproduce stage: dvc repro --single-item aggregate_all_wsi_texts
1) View the pipeline structure and dependencies
dvc dag
2) Running the entire pipeline
dvc repro
1) Commit pipeline metadata to github
git add dvc.yaml dvc.lock
git commit -m "Add full WSI data pipeline with DVC"
git push -u origin main
2) Push data to remote storage
dvc push
- run all in Jupyter Notebooks in
wsi_data_pipeline/notebooks/
folder.
-
You can customize different pipelines within different modules as your wish in
.py
format, and store them inpipelines
folder. -
Pipeline (python) scripts location:
src/pipelines/
- You can also directly write
DVC
stages in thedvc.yaml
file withoutCLI
.