-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0077: Partial pipeline execute. #484
Conversation
/kind tep |
/assign |
f298988
to
559f6fe
Compare
@bobcatfish, @jerop and @Tomcli may we use this PR for brain storming ideas. |
/assign |
Exciting!! Thanks for getting this going @ScrapCodes !! :D /assign |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
afcbfa2
to
74caa44
Compare
Note I'm going to be out for 2 weeks so in the meantime I'm happy with whatever @jerop decides in my place 😇 |
8759599
to
a5d37d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for picking this up @ScrapCodes, excited to see this moving forward! 😁
it's great that there's already a proposal here, but it'd be even better if it came with it's alternatives so that we can weigh different options before committing to this proposal as the way forward
if it's too much to add alternatives as well in this PR, maybe this PR can focus on the problem statement only then the next one can have the proposal and its alternatives
|
||
### Notes/Caveats (optional) | ||
|
||
Q. Can we provide an option to disable a task but not all the that depend on it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be the use case for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of the resume failed pipeline run,
- In the proposal we are saying, we will retry failed tasks and if they have dependence on another tasks's results - we will reference results from previous runs. A user can override the results from previous Run, by redeclaring in this section.
- Suppose a failed task is permanently failing, and one would like to not retry that task but its dependents, then the user can hard code the results and execute the dependents.
My preference is to not include this in the current scope of this TEP, but a future TEP can address this.
EDIT: added a ## Future Work
section.
and resume at later point. | ||
|
||
## Requirements | ||
- Create a new `PipelineRun` to resume or retry a completed `PipelineRun`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also have a requirement for the partial execution of a pipeline
, where there's no previous pipelinerun
(represented in use case 2 above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use case 2 and the requirement 2 are of less importance for us. But, we are happy(more than happy) to pursue them, incase you think we should drop them for this proposal, then I can move the disableTasks
field under pipelineRunRef
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we figure out from the dag what to run, how does the controller knows that what it runs from (the first failure) is going to work ? what if the first part of the run are not indempotent and thus won't produce the same result ?
I am wondering how much use cases / users this affects. If this is covering about 5% of uses cases from Pipeline, then we might complexify Pipeline (syntax, code, …) for very little gain. As commented, I'd rather explore a higher level construct such as Tekton Workflow than adding this support in Pipeline itself — a simple and robust core on top of which we can build powerful abstraction. And this use case is, for me, a perfect example of something I'd rather have on top of pipeline, not in pipeline.
- Optimal use of resources:
tektoncd
as a backend for ML.
A machine learning pipeline may consist of tasks moving large amount of
data and then training ml models, all of it can be very resource consuming
and inability to retry would require a user to start the entire pipeline
over. A manual retry, with the ability to specify what tasks should
be skipped, may be helpful.
This also goes down to tektoncd
uses as a backend for ML. Is it the target, or is it a side effect ? Should we bend pipeline's core to adapt to ML, or should we build on top of it to adapt. For example, nothing prevent us to create a specific tekton ML Pipeline that would be optimized for that type of stuff and would rely on Task/TaskRun only from the core pipeline.That would enable those kind of use case without complexifying the core 👼🏼
pipelines at Kubeflow. | ||
2. It is not enough to `retry` a `PipelineTask` n times, as the failures can | ||
be due to e.g. service outage. A manual resume/ retry may be helpful. | ||
3. Iterate quickly, by disabling tasks that take longer time. This can be done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This point is at "authoring" time, the other two are at the "runtime" time
3. Partial execution is also helpful for testing, i.e. skipping some tasks | ||
and developing and testing iteratively and quickly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we talk about testing here, it's "testing tektoncd/pipeline, testing its feature or at least testing a pipeline when we write it", right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Testing a pipeline while we develop it.
3. Partial execution is also helpful for testing, i.e. skipping some tasks | ||
and developing and testing iteratively and quickly. | ||
4. Pause and resume, i.e. one could manually cancel a running `PipelineRun` | ||
and resume at later point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure I see the use case 😅
- `pipelineRunRef` : pipelineRunRef references a previous pipelineRun and by default | ||
selects all the failed and unfinished tasks eligible for retrying/resuming. | ||
It references results of completed tasks from previous run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is more than referenced results to take into account. What about the data that was provided (workspace
, …) ? What about required "one-shot" task that would be required for the pipeline to run (like getting a one-time credential that would need to be re-issued in case of a new run, …) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(it's handled partly by the next -
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For workspaces we could take two approaches:
- Try to maximise re-run success: if a task mounts a volume in
write
mode, and a task that depends on it (directly or indirectly) needs to be executed, that task needs to be re-run too - Try to minimise resource utilisation: only re-run failed tasks and tasks that where not executed because of interruption of the pipeline. They might fail again because of missing data on the workspace
Other resources might be missing still, like a test k8s cluster or any external resource created by initial tasks.
Having init
tasks might be one way to solve this. I think a proper solution would be to let task specify input/output resources in some format (similar to the PipelineResources
we had), so that tasks may declare what they provision and what provisioned resources they rely on.
- `pipelineRunRef.enableTasks`: If a task was successful in previous run, but | ||
it is required by the current run, this section can be used to explicitly | ||
enable it. For example, a task may perform some initialization for the | ||
other tasks in `PipelineRun`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be on the user to set right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ! It is not possible to auto-detect it, AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be in future if we add more metadata to our pipelines and workspaces.
We could let users specify that certain tasks are initialisation ones, and need to be executed on re-run.
should be disabled in the `disableTasks` section. | ||
|
||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is additional alternatives, and one would be that this is handled by a higher level construct/abstraction, such as Tekton workflows for example.
## References (optional) | ||
|
||
1. [Google Doc: Disabling a Task in a Pipeline](https://docs.google.com/document/d/1rleshixafJy4n1CwFlfbAJuZjBL1PQSm3b0Q9s1B_T8/edit#heading=h.jz9jia3av6h1) | ||
2. [TEP-0065](https://github.com/tektoncd/community/pull/422) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive by request: it'd be great to include a link to tektoncd/pipeline#50 as well - we're addressing our oldest open issue in pipelines!! :D
@vdemeester are you wondering this about both of the sets of functionality being described here? like jerop mentioned the tep currently seems to be covering 2 things:
I wanted to check if your concerns are primarily about (2) or if they cover both? |
My concern is about both and trying to think if the complexity it adds to the core ( Let's assume we have 1 million users, in those 100000 users, about 5 percent might be interested by this feature, so about 5000 users. And if one those 5000 users, about 99% of the case, have smaller pipeline and an opinionated event driven setup would fullfil there case. We would add quite some complexity to the code for a very small fraction of our users (~50 out of 100000 ?) — which means more potential bugs, more confusing case, more issues, … I would make a parallel with |
@ScrapCodes: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Somehow I still feel, lower level building blocks are needed. Otherwise, even from the higher abstraction, the only options are create a new PipelineRun with desired tasks through automation. I feel having disabled task here is simpler than doing complex stuff in higher abstractions. |
@ScrapCodes are you okay with the approach where this particular addresses just adding |
@bobcatfish I agee! |
Indeed, from a higher abstraction, the options are to create a new I think my main point here is : we should explore those higher level abstraction to solve this problem, and see what are the shortcomings. |
Thank you @jerop, @bobcatfish and @vdemeester for kindly taking a look. Sorry for the long disappearance, I was on a long vacation. I am in favour of keeping the pipeline run as the basic building block, however with the current state of the features. One needs to create a completely new pipeline definition and pipelineRun definition, with the tasks copied either through some automation (the case of building a higher abstraction) or manually (until we do not have a higher abstraction). Having a support for With when expression, a pipeline needs to be pre-designed with a view that its tasks may be disabled in the future. (downside : extra verbosity) |
Also provide an option for hard-conding the results. | ||
My preference is this can be work for future TEP. | ||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality would have an impact on how we generate attestations.
If an artefact is produced by a pipeline that was partially executed, or resumed, how do we track that?
One simple initial answer could be that we don't generate attestations for partially executed pipelines.
Alternatively we could capture the spec of what is actually being executed and what it's inputs were, including partial results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the artifact that the attestation was created for was produced, it still makes sense to create the attestation even if the pipeline didn't complete or was partially executed. Capturing the spec (or some reference to the spec) sounds like a nice compromise to me.
(could this happen within a retried taskrun as well?)
`disableTasks` can be used to explicitly disable tasks that a user | ||
do not wish to run. On the other hand, in `pipelineRunRef` tekton controller | ||
automatically figures out the tasks failed and unfinished, because it knows the | ||
DAG. For the end user, it can be difficult to figure out the DAG and prepare | ||
the accurate execution plan for the next pipeline run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comments here show that the various use cases, running a partial pipeline, resuming a cancelled pipeline and retrying a failed pipeline come with different priorities and requirements.
The initial TEP started by @ScrapCodes was about retrying a failed taskrun/pipelinerun.
We saw common aspects with the partial execution which lead to this TEP, but perhaps we should first do some work to spell out what are the common aspects between the various use cases / features.
In my mind I see re-running a failed / cancelled pipeline - without redoing all the work - as the main use case. For that to be useful it should be up to the controller to decide what to run, not to the user.
The partial execution use case has different needs. The user wants to skip an expensive task in the pipeline for testing purposes or because a system it depends on is temporarily not available. In this case it's up to the user to say which tasks to skip, and it's up to the controller to say if the request can be served and what are the consequences - i.e. other tasks might be skipped because of the missing ones.
- `pipelineRunRef.enableTasks`: If a task was successful in previous run, but | ||
it is required by the current run, this section can be used to explicitly | ||
enable it. For example, a task may perform some initialization for the | ||
other tasks in `PipelineRun`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be in future if we add more metadata to our pipelines and workspaces.
We could let users specify that certain tasks are initialisation ones, and need to be executed on re-run.
- `pipelineRunRef` : pipelineRunRef references a previous pipelineRun and by default | ||
selects all the failed and unfinished tasks eligible for retrying/resuming. | ||
It references results of completed tasks from previous run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For workspaces we could take two approaches:
- Try to maximise re-run success: if a task mounts a volume in
write
mode, and a task that depends on it (directly or indirectly) needs to be executed, that task needs to be re-run too - Try to minimise resource utilisation: only re-run failed tasks and tasks that where not executed because of interruption of the pipeline. They might fail again because of missing data on the workspace
Other resources might be missing still, like a test k8s cluster or any external resource created by initial tasks.
Having init
tasks might be one way to solve this. I think a proper solution would be to let task specify input/output resources in some format (similar to the PipelineResources
we had), so that tasks may declare what they provision and what provisioned resources they rely on.
@afrittoli i see partial execution as the building blocks needed to implement the use cases that this discussion was initially started around. It's also quite possible that we decide that an API for pipeline retries might belong somewhere else, e.g. Workflows - by implementing partial execution, we have flexibility around whether we then add the PipelineRun retry capability at the PipelineRun level or at the Workflow level. I also suggest we discuss this a bit so that we can be sending @ScrapCodes in a consistent direction b/c I think the feedback he's gotten so far has been to address these problems separately, addressing partial execution first, and it sounds like you don't agree. I'll add this onto the API WG agenda for next week as well. |
Thanks @bobcatfish - I agree that having more discussion on this would be useful. I'm not saying I don't agree with solving the problems separately, but I'd like to understand what is the surface of overlap between the different features. From the API point of view I think they are quite independent, but there may be controller side features shared between the two. |
I can try to explain a bit around the overlap that I see. From the TEP:
From what I can tell, When using If we were to implement only Additionally, in order to implement Does that help at all @afrittoli ? |
I am still not sold on the value of |
quick update from yesterday's API WG: added @vdemeester 's concerns to the agenda but we didn't get to them so will be first on the agenda to discuss in the next meeting on nov 1 |
Some of the points, I would like to consider.
|
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Apologies @ScrapCodes it looks like this TEP got a bit stuck after the nov 8 API working group meeting iirc @vdemeester is pushing back on moving forward with this - where do you stand regarding this @ScrapCodes ? |
@bobcatfish I am not currently actively pursuing it. |
kk thanks for the update @ScrapCodes , sounds like we could close this TEP for now then and we can re-open and keep discussing later if the need comes up again. /close |
@bobcatfish: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thank you @bobcatfish |
Is there any further discussion on this topic? This feature is very needed in our scenario: there are many serial tasks in a pipeline. When the last task fails, automatic retry cannot recover. So the failed task needs to be retried manually. |
Summary
Add an ability for
PipelineRun
to have disabled tasks i.e. aPipelineRun
can execute aPipeline
partially.Allow
PipelineRun
to be created from previousPipelineRun
.So, a
PipelineRun
can be partially run or cancelled at run time, and resumed at a later point with the help of work proposed in this TEP.Together these will bring in the ability to resume/retry a failed
PipelineRun
/cc. @jerop @bobcatfish @Tomcli