Skip to content

Conversation

@rxu17
Copy link
Contributor

@rxu17 rxu17 commented Aug 26, 2025

Problem:

Our current workflow is:

  • Intake clinical data from synapse folder and process (we have all clinical data available) then run cbioportal validator
  • Intake mafs from synapse folder and process (this is staggered based on what is available so we don't have all datasets available) then run cbioportal validator
  • then run cbioportal validator on neoantigen, gene expression and gene signature data (given by collaborator so this isn't on synapse)

But the pipeline was created so that you could upload mafs, clinical data separately, and you always end up running clinical data through processing before it can validate on all files. This order of ops doesn't make sense for our current intake workflow.

Depends on: #122

Solution:

  • Have the pipeline process and save maf and clinical files locally.
  • Have a designated script to run cbioportal validator AND then upload all available to Synapse.

This is because we always want to validate all of the data as a group and never individually when using the cbioportal validator (and it's the final check that the data is good to go to upload).

This edit also includes setting up the project environment with Docker.

Along with adjustments to allow us to use the datahub-curation-tools repo here. The PR to the curation tools is here: cBioPortal/datahub-study-curation-tools#67

Testing:

  • Tested on regular processing for clinical and maf data and results match

@rxu17 rxu17 marked this pull request as ready for review August 27, 2025 06:26
@rxu17 rxu17 requested a review from a team as a code owner August 27, 2025 06:26
WORKDIR /root/

# clone dep repos
RUN git clone https://github.com/rxu17/datahub-study-curation-tools.git -b upgrade-to-python3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be updated once the PR to this repo: https://github.com/cBioPortal/datahub-study-curation-tools gets merged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can ask Ramya and Ritika to see if the code can be reviewed, but this should probably be a Dockerfile comment than a PR comment. I'm unsure when that will be reviewed and merged

@rxu17
Copy link
Contributor Author

rxu17 commented Sep 2, 2025

@danlu1 sorry, I would hold off on a review as I discovered something last week/still need to make updates and investigate. Thanks for the review so far!

@rxu17
Copy link
Contributor Author

rxu17 commented Sep 18, 2025

@danlu1 Feel free to review, looks like there wasn't any issue

@rxu17 rxu17 requested a review from danlu1 September 18, 2025 02:22
* add anders dataset specific filtering, convert lens map to be string vals

* address PR comments
* initial commit for incorporating neoantigen data

* rearrange code to have a general validation script

* add tests

* remove unused code

* remove unused code

* add unit tests and docstring

* update docstring order of ops

* add indicator in logs for any error that study failed, address PR comments
@sonarqubecloud
Copy link

@rxu17 rxu17 merged commit 7d91b10 into main Sep 29, 2025
3 checks passed
@rxu17 rxu17 deleted the refactor_pipeline branch September 29, 2025 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants