Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 115 additions & 38 deletions local/iatlas/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ python3 local/iatlas/lens.py run --dataset_id <yaml-dataset-synapse-id> --s3_pre

### Overview

#### maf_to_cbioportal.py
#### maf.py
This script will run the iatlas mutations data through genome nexus so it can be ingested by cbioportal team for visualization.

The script does the following:
Expand All @@ -115,7 +115,7 @@ The script does the following:
5. [Creates the required meta_* data](https://github.com/cBioPortal/datahub-study-curation-tools/tree/master/generate-meta-files)


#### clinical_to_cbioportal.py
#### clinical.py
This script will process/transform the iatlas clinical data to be cbioportal format friendly so it can be ingested by cbioportal team for visualization.

The script does the following:
Expand All @@ -128,18 +128,59 @@ The script does the following:


### Setup
- `pandas` == 2.0
- `synapseclient`==4.8.0
Prior to testing/developing/running this locally, you will need to setup the Docker image in order to run this.
Optional: You can also build your environment via python env and install the `uv.lock` file

1. Create and activate your venv

```
python3 -m venv <your_env_name>
source <your_env_name>/bin/activate
```

2. Export dependencies from uv.lock

```
pip install uv
uv export > requirements.txt
```

3. Install into your venv

```
pip install -r requirements.txt
```

But it is highly recommended you use the docker image

1. Build the dockerfile

```
cd /orca-recipes/local/iatlas/cbioportal_export
docker build -f Dockerfile -t <some_docker_name> .
```

2. Run the Dockerfile

```
docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN <some_docker_image_name>
```

3. Follow the **How to Run** section below

### How to Run

Getting help
```
python3 clinical_to_cbioportal.py --help
python3 clinical.py --help
```

```
python3 maf.py --help
```

```
python3 maf_to_cbioportal.py --help
python3 load.py --help
```

### Outputs
Expand All @@ -148,7 +189,7 @@ This pipeline generates the following key datasets that eventually get uploaded
All datasets will be saved to:
`<datahub_tools_path>/add-clinical-header/<dataset_name>/` unless otherwise stated

#### maf_to_cbioportal.py
#### maf.py

- `data_mutations_annotated.txt` – Annotated MAF file from genome nexus
- Generated by: `concatenate_mafs()`
Expand All @@ -160,7 +201,7 @@ All datasets will be saved to:
- Generated by: `datahub-study-curation-tools`' `generate-meta-files` code


#### clinical_to_cbioportal.py
#### clinical.py

- `data_clinical_patient.txt` – Clinical patient data file
- Generated by: `add_clinical_header()`
Expand All @@ -181,6 +222,17 @@ All datasets will be saved to:
- `<datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/`
- Generated by: `datahub-study-curation-tools`' `generate-case-lists` code


#### validate.py

- `iatlas_validation_log.txt` - Validator results from our own iatlas validation results for all of the files
- Generated by: updated by each validation function

- `cbioportal_validator_output.txt` – Validator results from cbioportal for all of the files not just clinical
- Generated by: `cbioportal`' validator code

#### load.py

- `cases_all.txt` – case list file for all the clinical samples in the study
- `<datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/`
- Generated by: `datahub-study-curation-tools`' `generate-case-lists` code
Expand All @@ -190,59 +242,84 @@ in the study
- `<datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/`
- Generated by: `datahub-study-curation-tools`' `generate-case-lists` code

- `cbioportal_validator_output.txt` – Validator results from cbioportal for all of the files not just clinical
- Generated by: `cbioportal`' validator code


Any additional files are the intermediate processing files and can be ignored.


### General Workflow

1. Do a dry run on the maf datasets (this won't upload to Synapse).
2. Do a dry run on the clinical datasets (this won't upload to Synapse, will run the cbioportal validator and output results from there)
3. Check your `cbioportal_validator_output.txt` from the dry run.
4. Resolve any `ERROR`s
5. Repeat steps 1-3 until all `ERROR`s are gone
6. Run the same command now without the `dry_run` flag (so you upload to Synapse) for both the clinical and maf datasets
1. Run processing on the maf datasets via `maf.py`
2. Run processing on the clinical datasets via `clinical.py`
3. Run `load.py` to create case lists
4. Run the general validation + cbioportal validator on your outputted files via `validate.py`
5. Check your `cbioportal_validator_output.txt`
6. Resolve any `ERROR`s
7. Repeat steps 4-6 until all `ERROR`s are gone
8. Run `load.py` now with the `upload` flag to upload to synapse

**Example:**
Doing a dry run on all of the datasets:
Sample workflow

For clinical
Run clinical processing
```
python3 clinical_to_cbioportal.py
python3 clinical.py
--input_df_synid syn66314245 \
--cli_to_cbio_mapping_synid syn66276162
--cli_to_oncotree_mapping_synid syn66313842 \
--output_folder_synid syn64136279 \
--datahub_tools_path /<some_path>/datahub-study-curation-tools \
--cbioportal_path /<some_path>/cbioportal
--lens_id_mapping_synid syn68826836
--dry_run
--neoantigen-data-synid syn21841882
```

For mafs
Run maf processing
```
python3 maf_to_cbioportal.py
python3 maf.py
--dataset Riaz
--input_folder_synid syn68785881
--output_folder_synid syn68633933
--datahub_tools_path /<some_path>/datahub-study-curation-tools --n_workers 3
--dry_run
--datahub_tools_path /<some_path>/datahub-study-curation-tools
--n_workers 3
```

**Example:**
Saving clinical files to synapse with comment
Create the case lists
```
python3 load.py
--dataset Riaz
--output_folder_synid syn64136279
--datahub_tools_path /<some_path>/datahub-study-curation-tools
--create_case_lists
```

Run the general iatlas validation + cbioportal validator on all files
```
python3 clinical_to_cbioportal.py
--input_df_synid syn66314245 \
--cli_to_cbio_mapping_synid syn66276162
--cli_to_oncotree_mapping_synid syn66313842 \
--output_folder_synid syn64136279 \
--lens_id_mapping_synid syn68826836 \
--datahub_tools_path /some_path/datahub-study-curation-tools \
--cbioportal_path /<some_path>/cbioportal
python3 validate.py
--datahub_tools_path /<some_path>/datahub-study-curation-tools
--neoantigen_data_synid syn69918168
--cbioportal_path /<some_path>/cbioportal/
--dataset Riaz
```

Save into synapse with version comment `v1`

```
python3 load.py
--dataset Riaz
--output_folder_synid syn64136279
--datahub_tools_path /<some_path>/datahub-study-curation-tools
--version_comment "v1"
--upload
```

### Running tests

Tests are written via `pytest`.

In your docker environment or local environment, install `pytest` via

```
pip install pytest
```

Then run all tests via
```
python3 -m pytest tests
```
26 changes: 26 additions & 0 deletions local/iatlas/cbioportal_export/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# uv + Python 3.10 preinstalled
FROM ghcr.io/astral-sh/uv:python3.10-bookworm
WORKDIR /root/cbioportal_export/

RUN uv venv /opt/venv
# Use the virtual environment automatically
ENV VIRTUAL_ENV=/opt/venv
# Place entry points in the environment at the front of the path
ENV PATH="/opt/venv/bin:$PATH"

# Install dep
COPY pyproject.toml uv.lock* ./

# Install exactly what's locked (fails if lock is out of date)
RUN uv sync --frozen --no-dev

# copy code
COPY . .

WORKDIR /root/

# clone dep repos
RUN git clone https://github.com/rxu17/datahub-study-curation-tools.git -b upgrade-to-python3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be updated once the PR to this repo: https://github.com/cBioPortal/datahub-study-curation-tools gets merged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can ask Ramya and Ritika to see if the code can be reviewed, but this should probably be a Dockerfile comment than a PR comment. I'm unsure when that will be reviewed and merged

RUN git clone https://github.com/cBioPortal/cbioportal.git -b v6.3.2

WORKDIR /root/cbioportal_export/
Loading