Skip to content

Include WorkflowMetadata in lineage records #6069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented May 13, 2025

First approach to include information which is in the WorkflowMetadata class in Data lineage records.

  • Add revision, projectName and Manifest to the Workflow class
  • Add Workflowrun.metadata field to store other metadata defined at onFlowBegin (not null) and not used in Workflow or WorkflowRun

TODO:

  • Create a new version?, Manage old versions?

Copy link

netlify bot commented May 13, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit da0a52b
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/68231c1dfa634600089584e8

@jorgee jorgee marked this pull request as draft May 13, 2025 10:21
@jorgee jorgee changed the title include workflow metadata in lineage records Include WorkflowMetadata in lineage records May 13, 2025
Comment on lines +78 to +82
private static List<String> workflowMetadataPropertiesToRemove = [
"sessionId", "name", //Already in workflowRun
"scriptFile", "scriptName", "scriptId", "repository", "commitId", "revision", "projectName", "manifest", //Already in workflow
"stats", "success" // End
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be preferable to keep as much as possible aligned to WorkflowMeta structure, in the same way as it's done in the tower client

we could consider removing repeated values in the parent object

Copy link
Contributor Author

@jorgee jorgee May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I get your comment.

If I am not wrong, in TowerClient, the workflow is a plain map that contains everything that is in WorkflowMetadata, removing stats and adding other properties like the resolved config and params.

If we do the same and include everything as property of the WorkflowRun class, it will be very difficult to maintain, so I prefer to keep it inside the metadata property (or rename it to another name). If a new parameter is added in the WorkflowMetadata, it will also be added to the 'metadata' without any model update.

A small difference is the sessionId and name that are properties in WorkflowRun and they were also in the WorkflowMetadata map. I do not see a problem with keeping it in metadata and removing it from WorkflowRun. We can also add the resolved config if you think it is better.

The big difference is in the workflow and params. I think they are the parts that mainly describe the WorkflowRun, and, in the TowerClient, they are spread in a set of properties. In workflow, we already included the information that is in scriptFile, scriptName, scriptId, repository, commitId. I have just added other info (revision, projectName and manifest) that I think they are describing the workflow more than the execution. In fact, I think the workflow description could have a separate record, and LID and just have a reference in the WorkflowRun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants