WSI Data Pipeline

Directory Structure

├── wsi_data_pipeline
├   ├── README.md
├   ├── HistoGPT <- git clone https://github.com/marrlab/HistoGPT.git
├   ├── assets  <- such result files (.png)
├   ├── histogpt_install_setup <- such files overwrite some config files in the HistoGPT folder
├   ├── requirements.txt <- specify the external Python packages and their specific versions that the project depends on
├   ├── .dvc  <- (optional) such files related to data version control are under DVC, automatically            
├   ├            generated by the cmd `dvc init` (more details in the following)
├   ├── config  <- any .yml configuration files related to MLOps
├   ├── notebooks <- jupyter notebooks for demo experiments
├   ├── logs <- log files 
├   ├── model_checkpoints  <- checkpoints (.pth format) and related config files
├   │   ├── biogpt 
├   │   ├── ctranspath
├   │   ├── histogpt 
├   ├── data
├   │   ├── interim <- data in intermediate processing stage 
├   │   ├   ├── patches_and_embeds <- save generated patches and embeddings for WSIs
├   │   ├── processed <- data after all preprocessing has been done
├   │   ├   ├── wsi_texts <- save generated clinical reports for WSIs
├   │   ├   ├── result.csv <- aggrated all generated clinical reports in a .csv file.
├   │   └── raw <- original unmodified WSI images (.svs or .ndpi formats)
├   └── src
├   ├    ├── data <- scripts of data preparing and/or preprocessing
├   ├    ├── logger <- scripts of logging configs
├   ├    ├── evaluate <- scripts of evaluating model (#TODO)
├   ├    ├── pipelines <- scripts of pipelines (#TODO)
├   ├    ├── report <- scripts of visualization (often used in notebooks) (#TODO)
├   ├    ├── train <- scripts of training model (#TODO)
├   ├    └── utils.py <- auxiliary functions and classes (#TODO)
├   └──  dvc.yaml (optional) <- include all configs related pipelines and stages
├   ├                           under DVC management (more details in the following)

Preparation

1. Create conda-env `wsi_data_pipeline`

conda create -n wsi_data_pipeline python=3.10
conda activate wsi_data_pipeline

2. Clone `HistoGPT` repository and install `histogpt` api

cd wsi_data_pipeline

git clone https://github.com/marrlab/HistoGPT.git
cp -f ./histogpt_install_setup/setup.py ./HistoGPT/setup.py

## install histogpt
cd HistoGPT
pip install .

3. Install other related depencies for `env`

cd wsi_data_pipeline
pip install -r requirements.txt

4. Download WSI images

Download different WSI image zip files, unzip them and get the WSI image in .svs format.
- Link for an example
Change the .svs suffix into .ndpi.
Store them into the wsi_data_pipeline/data/raw/ folder.

5. Download models

Go to the following Link to download related models or config files.

Link for ctranspath and histogpt
1. Download ctranspath.pth and move it to the model_checkpoints/ctranspath/ folder.
2. Download histogpt-1b-6k-pruned.pth and move it to the model_checkpoints/histogpt/ folder.
3. Download other .json files related to the configs of histogpt and move them to the model_checkpoints/histogpt/ folder.
Link for biogpt
1. Download config.json, merges.txt, vocab.json files related to the tokenizer for prompt encoding and move them to the model_checkpoints/biogpt/ folder.

6. Initialize DVC init (Optional)

If you do not use DVC for data management, you can skip this step.

1) Install DVC pip install dvc

Link for installation instructions

2) Initialize DVC init

!! Before you initialize DVC, you must use git init to initialize the repository in GitHub that you want to use for DVC.

Initialize DVC

cd wsi_data_pipeline
dvc init

Commit dvc init

git commit -m "Initialize DVC"

3) Add remote storage for DVC (any cloud or local folder)

Link for Cloud Storage set-up

# cd wsi_data_pipeline
dvc config cache.type copy
dvc remote add -d my_storage /tmp/dvc_storage # local-storage

4) Add data fnder DVC management (any cloud or local folder)

!! You can add any other data you want under DVC management, but when you add them under DVC management, you must move them from the .gitignore in the root of the project.

## commit raw data to Git
dvc add data/raw/

git add data/raw.dvc .gitignore
git commit -m "Add raw data under DVC management"
git push -u origin main

## push data to remote storage
dvc push

How to Work

Run without `dvc` (Quick Start)

1) Set configs in the config/wsi_data_pipe_config.yml file.

save_patch_image: bool is used to control whether the process needs to save the generated patches for each WSI file in the data/interim/patches_and_embeds/patches/ folder.

2) Two ways to Run

Run with the followng cmd:

  cd wsi_data_pipeline
  python src/pipelines/wsi_data_pipe.py --config config/wsi_data_pipe_config.yml

Run the Jupyter notebook with the step by step:

 cd wsi_data_pipeline/notebooks

3) The final generated clinic reports are data/processed/ folder, the (on-demand) generated patches and patch embeddings are in the data/interim/patches_and_embeds/ folder.

Run with `dvc` (Optional)

1. Defining the Stages (Pipeline) with `CLI`

1) Generate patches and embeddings

dvc stage add -n generate_wsi_patches_and_embed \
  -d src/data/generate_wsi_patches_and_embed.py \
  -d config/wsi_data_pipe_config.yml \
  -d data/raw \
  python src/data/generate_wsi_patches_and_embed.py --config config/wsi_data_pipe_config.yml

this stage:

Generate embeddings and patches (optional) of each WSI using CTranspath model.
Store embeddings for text generation step in data/interim/patches_and_embeds/h5_files/ folder and patches (on demand) in data/interim/patches_and_embeds/patches/ folder.

Reproduce stage: dvc repro --single-item generate_wsi_patches_and_embed

2) Generate Clinic Report Texts from generated WSI embeddings

dvc stage add -n generate_wsi_text_with_patches_and_prompt_embeds \
  -d src/data/generate_wsi_texts.py \
  -d config/wsi_data_pipe_config.yml \
  python src/data/generate_wsi_texts.py --config config/wsi_data_pipe_config.yml

this stage:

Generate clinical reports for each WSI image using prompt embeddings and patch embeddings
Store each clinical in data/processed/ folder within the name of $wsi_name.txt.

Reproduce stage: dvc repro --single-item generate_wsi_text_with_patches_and_prompt_embeds

3) Aggregate all generated Clinic Report Texts

dvc stage add -n aggregate_all_wsi_texts \
  -d src/data/aggregate_all_wsi_texts.py \
  -d config/wsi_data_pipe_config.yml \
  -d data/processed/clinical_reports \
  python src/data/aggregate_all_wsi_texts.py --config config/wsi_data_pipe_config.yml

this stage: Aggregate all clinical report texts into a final summarized output. The results are persisted as the final processed data.

Reproduce stage: dvc repro --single-item aggregate_all_wsi_texts

2.Running the Pipeline

1) View the pipeline structure and dependencies

dvc dag

2) Running the entire pipeline

dvc repro

3. After Running, Data Sharing and Version Control

1) Commit pipeline metadata to github

git add dvc.yaml dvc.lock
git commit -m "Add full WSI data pipeline with DVC"
git push -u origin main

2) Push data to remote storage

dvc push

Tutorial

1. All in Junyter Notebooks

run all in Jupyter Notebooks in wsi_data_pipeline/notebooks/ folder.

2. All related stages in `src` modules

You can customize different pipelines within different modules as your wish in .py format, and store them in pipelines folder.
Pipeline (python) scripts location: src/pipelines/

3. All stages in `dvc.yaml` file

You can also directly write DVC stages in the dvc.yaml file without CLI.
- DVC Stage Definition

4. DVC Documentation

DVC tutorial

Results

Model: `histogpt-1b-6k-pruned.pth`

1) TCGA-GU-A42R-01A-01-TSA

2) TCGA-GU-A763-01A-01-TS1

3) TCGA-GU-AATQ-01A-01-TSA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WSI Data Pipeline

Directory Structure

Preparation

1. Create conda-env `wsi_data_pipeline`

2. Clone `HistoGPT` repository and install `histogpt` api

3. Install other related depencies for `env`

4. Download WSI images

5. Download models

6. Initialize DVC init (Optional)

How to Work

Run without `dvc` (Quick Start)

Run with `dvc` (Optional)

1. Defining the Stages (Pipeline) with `CLI`

2.Running the Pipeline

3. After Running, Data Sharing and Version Control

Tutorial

1. All in Junyter Notebooks

2. All related stages in `src` modules

3. All stages in `dvc.yaml` file

4. DVC Documentation

Results

Model: `histogpt-1b-6k-pruned.pth`

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.dvc		.dvc
assets		assets
config		config
data		data
histogpt_install_setup		histogpt_install_setup
notebooks		notebooks
src		src
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
requirements.txt		requirements.txt

xinghao302001/wsi_data_pipeline_dev

Folders and files

Latest commit

History

Repository files navigation

WSI Data Pipeline

Directory Structure

Preparation

1. Create conda-env wsi_data_pipeline

2. Clone HistoGPT repository and install histogpt api

3. Install other related depencies for env

4. Download WSI images

5. Download models

6. Initialize DVC init (Optional)

How to Work

Run without dvc (Quick Start)

Run with dvc (Optional)

1. Defining the Stages (Pipeline) with CLI

2.Running the Pipeline

3. After Running, Data Sharing and Version Control

Tutorial

1. All in Junyter Notebooks

2. All related stages in src modules

3. All stages in dvc.yaml file

4. DVC Documentation

Results

Model: histogpt-1b-6k-pruned.pth

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Create conda-env `wsi_data_pipeline`

2. Clone `HistoGPT` repository and install `histogpt` api

3. Install other related depencies for `env`

Run without `dvc` (Quick Start)

Run with `dvc` (Optional)

1. Defining the Stages (Pipeline) with `CLI`

2. All related stages in `src` modules

3. All stages in `dvc.yaml` file

Model: `histogpt-1b-6k-pruned.pth`

Packages