Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post processing / ML Code - Aggregate per algorithm #160

Merged
merged 11 commits into from
Jun 29, 2024

Conversation

ntalluri
Copy link
Collaborator

No description provided.

@ntalluri
Copy link
Collaborator Author

ntalluri commented Jun 10, 2024

Failed due to ML Bug #143

changes made in ML Bug are subject to change with this pull request

Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged the changes from #143 and left some high level comments before reviewing the code in detail.

spras/config.py Outdated
self.analysis_include_summary = raw_config["analysis"]["summary"]["include"]
self.analysis_include_graphspace = raw_config["analysis"]["graphspace"]["include"]
self.analysis_include_cytoscape = raw_config["analysis"]["cytoscape"]["include"]
self.analysis_include_ml = raw_config["analysis"]["ml"]["include"]

if 'aggregate_per_algorithm' not in self.ml_params:
raise ValueError("The 'aggregate_per_algorithm' parameter must be set to either true or false in ml analysis parameters.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have it use false as a default so that old config files are supported?


with pytest.raises(ValueError): #raises error if empty dataframe is used for post processing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would change if we use a default value as I proposed above

include: true
# required; adds ml analysis per algorithm output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be optional and default to false?

I haven't looked through the code line-by-line yet. Is the behavior that if this is true, we will get the overall summaries as before and then additionally compute these algorithm-specified summaries? That is what I would expect.

@agitter
Copy link
Collaborator

agitter commented Jun 14, 2024

We may need to think more about the errors we added in #143 when there are not enough pathways for the ML analysis. Now that we can aggregate by algorithm, that will be a common occurrence. Some algorithms don't have parameters so there will always be a single pathway generated. Should we inspect the number of parameter combinations in the config file and only try running ML analysis for those algorithms with multiple parameter combinations?

@ntalluri ntalluri changed the title Post processing / ML Codde - Aggregate per algorithm Post processing / ML Code - Aggregate per algorithm Jun 14, 2024
@ntalluri
Copy link
Collaborator Author

ntalluri commented Jun 14, 2024

Would the snake make command --keep-going be sufficient enough and allow the error to be printed out to the command line?

I feel like it's supressing errors we want to see, but this could be an option

@agitter
Copy link
Collaborator

agitter commented Jun 14, 2024

Would the snake make command --keep-going be sufficient enough

I would rather not require users to add another argument to Snakemake at the command line in order to successfully execute fairly common configurations. It seems like it would lead to a lot of confusion if forgetting that option causes the workflow to crash, so if we can figure out a strategy for writing default outputs or avoiding calling the per-algorithm aggregation when it won't work, that would be better.

@ntalluri
Copy link
Collaborator Author

ntalluri commented Jun 17, 2024

I updated the Snakemake workflow to only run for algorithms that are using multiple parameters combinations when it comes to ml-aggregate

Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two types of end-to-end integration tests that would be nice to have, but we aren't set up well to make those kinds of tests currently. They would require running the entire workflow. Those tests:

  • given the config file config.yaml, confirm that only the expected algorithms produce ensemble files
  • confirm that one of the expected algorithm-specific ensembles (e.g. data1-ml/omicsintegrator1-ensemble-pathway.txt) is correct

Are these tests too hard in our current framework?

This design is great. I confirmed that if I add another parameter combination to the config file for MEO and then rerun Snakemake, then an ensemble pathway and ML outputs are created for MEO.

Snakefile Outdated Show resolved Hide resolved
Snakefile Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
Snakefile Show resolved Hide resolved
config/config.yaml Outdated Show resolved Hide resolved
@agitter agitter merged commit ad4da94 into Reed-CompBio:master Jun 29, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants