Sage-Bionetworks-Workflows · rxu17 · Sep 29, 2025 · Aug 26, 2025 · Aug 27, 2025 · Aug 27, 2025
@@ -103,7 +103,7 @@ python3 local/iatlas/lens.py run --dataset_id <yaml-dataset-synapse-id> --s3_pre
 
 ### Overview
 
-#### maf_to_cbioportal.py
+#### maf.py
 This script will run the iatlas mutations data through genome nexus so it can be ingested by cbioportal team for visualization.
 
 The script does the following:
@@ -115,7 +115,7 @@ The script does the following:
 5. [Creates the required meta_* data](https://github.com/cBioPortal/datahub-study-curation-tools/tree/master/generate-meta-files)
 
 
-#### clinical_to_cbioportal.py
+#### clinical.py
 This script will process/transform the iatlas clinical data to be cbioportal format friendly so it can be ingested by cbioportal team for visualization.
 
 The script does the following:
@@ -128,18 +128,59 @@ The script does the following:
 
 
 ### Setup
-- `pandas` == 2.0
-- `synapseclient`==4.8.0
+Prior to testing/developing/running this locally, you will need to setup the Docker image in order to run this.
+Optional: You can also build your environment via python env and install the `uv.lock` file
+
+1. Create and activate your venv
+
+```
+python3 -m venv <your_env_name>
+source <your_env_name>/bin/activate
+```
+
+2. Export dependencies from uv.lock
+
+```
+pip install uv
+uv export > requirements.txt
+```
+
+3. Install into your venv
+
+```
+pip install -r requirements.txt
+```
+
+But it is highly recommended you use the docker image
+
+1. Build the dockerfile
+
+```
+cd /orca-recipes/local/iatlas/cbioportal_export
+docker build -f Dockerfile -t <some_docker_name> .
+```
+
+2. Run the Dockerfile
+
+```
+docker run --rm -it -e SYNAPSE_AUTH_TOKEN=$YOUR_SYNAPSE_TOKEN <some_docker_image_name>
+```
+
+3. Follow the **How to Run** section below
 
 ### How to Run
 
 Getting help
 ```
-python3 clinical_to_cbioportal.py --help
+python3 clinical.py --help
+```
+
+```
+python3 maf.py --help
 ```
 
 ```
-python3 maf_to_cbioportal.py --help
+python3 load.py --help
 ```
 
 ### Outputs
@@ -148,7 +189,7 @@ This pipeline generates the following key datasets that eventually get uploaded
 All datasets will be saved to:
 `<datahub_tools_path>/add-clinical-header/<dataset_name>/` unless otherwise stated
 
-#### maf_to_cbioportal.py
+#### maf.py
 
 - `data_mutations_annotated.txt` – Annotated MAF file from genome nexus
     - Generated by: `concatenate_mafs()`
@@ -160,7 +201,7 @@ All datasets will be saved to:
     - Generated by: `datahub-study-curation-tools`' `generate-meta-files` code
 
 
-#### clinical_to_cbioportal.py
+#### clinical.py
 
 - `data_clinical_patient.txt` – Clinical patient data file
     - Generated by: `add_clinical_header()`
@@ -181,6 +222,17 @@ All datasets will be saved to:
     - `<datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/`
     - Generated by: `datahub-study-curation-tools`' `generate-case-lists` code
 
+
+#### validate.py
+
+- `iatlas_validation_log.txt` - Validator results from our own iatlas validation results for all of the files
+    - Generated by: updated by each validation function
+
+- `cbioportal_validator_output.txt` – Validator results from cbioportal for all of the files not just clinical
+    - Generated by: `cbioportal`' validator code
+
+#### load.py
+
 - `cases_all.txt` – case list file for all the clinical samples in the study
     - `<datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/`
     - Generated by: `datahub-study-curation-tools`' `generate-case-lists` code
@@ -190,59 +242,84 @@ in the study
     - `<datahub_tools_path>/add-clinical-header/<dataset_name>/case-lists/`
     - Generated by: `datahub-study-curation-tools`' `generate-case-lists` code
 
-- `cbioportal_validator_output.txt` – Validator results from cbioportal for all of the files not just clinical
-    - Generated by: `cbioportal`' validator code
-
 
 Any additional files are the intermediate processing files and can be ignored.
 
 
 ### General Workflow
 
-1. Do a dry run on the maf datasets (this won't upload to Synapse).
-2. Do a dry run on the clinical datasets (this won't upload to Synapse, will run the cbioportal validator and output results from there)
-3. Check your `cbioportal_validator_output.txt` from the dry run.
-4. Resolve any `ERROR`s
-5. Repeat steps 1-3 until all `ERROR`s are gone
-6. Run the same command now without the `dry_run` flag (so you upload to Synapse) for both the clinical and maf datasets
+1. Run processing on the maf datasets via `maf.py`
+2. Run processing on the clinical datasets via `clinical.py`
+3. Run `load.py` to create case lists
+4. Run the general validation + cbioportal validator on your outputted files via `validate.py`
+5. Check your `cbioportal_validator_output.txt`
+6. Resolve any `ERROR`s
+7. Repeat steps 4-6 until all `ERROR`s are gone
+8. Run `load.py` now with the `upload` flag to upload to synapse
 
 **Example:**
-Doing a dry run on all of the datasets:
+Sample workflow
 
-For clinical
+Run clinical processing
 ```
-python3 clinical_to_cbioportal.py 
+python3 clinical.py 
     --input_df_synid syn66314245 \
     --cli_to_cbio_mapping_synid syn66276162 
     --cli_to_oncotree_mapping_synid syn66313842 \
-    --output_folder_synid syn64136279 \
     --datahub_tools_path /<some_path>/datahub-study-curation-tools \
-    --cbioportal_path /<some_path>/cbioportal
     --lens_id_mapping_synid syn68826836
-    --dry_run
+    --neoantigen-data-synid syn21841882
 ```
 
-For mafs
+Run maf processing
 ```
-python3 maf_to_cbioportal.py 
+python3 maf.py 
     --dataset Riaz
     --input_folder_synid syn68785881 
-    --output_folder_synid syn68633933 
-    --datahub_tools_path /<some_path>/datahub-study-curation-tools --n_workers 3 
-    --dry_run
+    --datahub_tools_path /<some_path>/datahub-study-curation-tools 
+    --n_workers 3 
 ```
 
-**Example:**
-Saving clinical files to synapse with comment
+Create the case lists
+```
+python3 load.py 
+    --dataset Riaz  
+    --output_folder_synid syn64136279 
+    --datahub_tools_path /<some_path>/datahub-study-curation-tools  
+    --create_case_lists
+```
 
+Run the general iatlas validation + cbioportal validator on all files
 ```
-python3 clinical_to_cbioportal.py 
-    --input_df_synid syn66314245 \
-    --cli_to_cbio_mapping_synid syn66276162 
-    --cli_to_oncotree_mapping_synid syn66313842 \
-    --output_folder_synid syn64136279 \
-    --lens_id_mapping_synid syn68826836 \
-    --datahub_tools_path /some_path/datahub-study-curation-tools \
-    --cbioportal_path /<some_path>/cbioportal
+python3 validate.py 
+    --datahub_tools_path /<some_path>/datahub-study-curation-tools 
+    --neoantigen_data_synid syn69918168 
+    --cbioportal_path /<some_path>/cbioportal/ 
+    --dataset Riaz  
+```
+
+Save into synapse with version comment `v1`
+
+```
+python3 load.py
+    --dataset Riaz  
+    --output_folder_synid syn64136279
+    --datahub_tools_path /<some_path>/datahub-study-curation-tools  
     --version_comment "v1"
+    --upload
+```
+
+### Running tests
+
+Tests are written via `pytest`.
+
+In your docker environment or local environment, install `pytest` via
+
+```
+pip install pytest
+```
+
+Then run all tests via
 ```
+python3 -m pytest tests
+```
@@ -0,0 +1,26 @@
+# uv + Python 3.10 preinstalled
+FROM ghcr.io/astral-sh/uv:python3.10-bookworm
+WORKDIR /root/cbioportal_export/
+
+RUN uv venv /opt/venv
+# Use the virtual environment automatically
+ENV VIRTUAL_ENV=/opt/venv
+# Place entry points in the environment at the front of the path
+ENV PATH="/opt/venv/bin:$PATH"
+
+# Install dep
+COPY pyproject.toml uv.lock* ./
+
+# Install exactly what's locked (fails if lock is out of date)
+RUN uv sync --frozen --no-dev
+
+# copy code 
+COPY . .
+
+WORKDIR /root/
+
+# clone dep repos
+RUN git clone https://github.com/rxu17/datahub-study-curation-tools.git -b upgrade-to-python3
+RUN git clone https://github.com/cBioPortal/cbioportal.git -b v6.3.2
+
+WORKDIR /root/cbioportal_export/