-
Notifications
You must be signed in to change notification settings - Fork 380
Collapse the INVALID_ARGUMENTS error rows and clarify #597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Collapse the INVALID_ARGUMENTS error rows and clarify #597
Conversation
For ControllerModifyVolume, clarify that INVALID_ARGUMENT is an infeasible error and that the SP has not made any changes, so the CO does not need to retry.
lgtm |
This could be a breaking change to SP. e.g. AWS does not comply with this now. |
In CSI this feature is alpha, this is why it is important we tighten the wording before this feature goes GA in CSI. |
I'm afraid this may be too much to ask for SP. In the previous AWS case, whether the parameter is invalid is determined by server, and cannot be determined solely in CSI. If combined with multi-stage modification, we may not able to verify all parameters of all stage before making any changes. I don't have such specific use-case though, but I expect we will have such case in the future. So if this change is applied, we will not able to return "invalid argument" for the second stage. We could replace them with Internal error. But IMO this is not a good design, because SP could easily implementing this wrong. |
SP can still return |
I don't think it's a large burden on the SP to force it to validate ALL of the parameters up front. Obviously this requires a round-trip to the actual server, but I can't think of a reason that the server couldn't validate all of the parameters as a unit. If anything is found to be invalid, it should be reported before any changes are made. Once any changes are made, the SP either needs to reverse them, or return some error other than INVALID_ARGUMENT to indicate that it got stuck and retries are needed to move forward. It's unreasonable to expect COs to deal with a situation where the user attempted to do something invalid, and the SP put the volume into an inconsistent state as a result, and the CO can't even detect whether it should retry or give up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Take AWS EBS as an example. It needs to modify both disk and tag. Obviously parameters to both stage can contains invalid values. The current implementation is modify tag first. After that succeeded, modify the disk. So if the CSI restarts after tag has been modified, but before disk. Then it find invalid argument when modifying disk in the new CSI process. How can it rollback the changes to tags? It does not know the tag value before modification. Maybe AWS guys can swap the order, modify the disk first, then tag. In that case, if the tags contains invalid argument, CSI even cannot rollback the changes, because we need to wait for 6 hours to modify the disk again. Maybe CSI can validate the tags locally before doing anything. But we cannot be 100% sure the local validation logic is in sync with server.
Not one round-trip, but multiple. 3 for AWS, 4 for Alibaba Cloud currently. Server only receives some of the parameters in any of the round-trip.
That is ideal. But do you expect CSI to implement an undo log system for this, as database usually do? It is way too complex. And CSI usually does not have access to persistent storage itself. It is generally very hard to make a workflow in complex distributed system transactional. The best we can promise is idempotency, so that the workflow can be retried. |
I'm not saying that error handling needs to be perfect. It's clearly difficult to reliably implement a multi-stage change. What I'm saying is that the SP can only use the ILLEGAL_ARGUMENT return code when it's safe for the CO to abandon the operation and no more retries are needed to get the volume into a good state. SPs should endeavor to do as much error checking up front as possible. Once the error checking has passed, and the parameters are deemed valid, then the SP should start mutating things. At that point, any failures should return some other error code (probably INTERNAL) if things go wrong. It's too late to inform the CO that a parameter is invalid once the SP starts making changes. In this case the CO will continue retrying so the SP can recover from various faults and continue to make forward progress. There is NOT an expectation that the SP can roll back from a failure. The general expectation is to roll forward using idempotent retries until success can be achieved. If the SP does have a way to roll backward and undo a change, that is also acceptable, but in practice very few implementations do that. If the SP cannot do sufficient error checking before starting to make changes, that should be viewed as a flaw in the underlying system, and the underlying system should offer a verification mechanism for proposed parameters. I would recommend against supporting any parameter in a VAC for which up-front verification can't be done. |
spec.md
Outdated
@@ -1711,8 +1711,7 @@ message ControllerModifyVolumeResponse { | |||
|
|||
| Condition | gRPC Code | Description | Recovery Behavior | | |||
|-----------|-----------|-------------|-------------------| | |||
| Parameters not supported | 3 INVALID_ARGUMENT | Indicates that the CO has specified mutable parameters not supported by the volume. | Caller MAY verify mutable parameters. | | |||
| Exceeds capabilities | 3 INVALID_ARGUMENT | Indicates that the CO has specified capabilities not supported by the volume. | Caller MAY verify volume capabilities by calling ValidateVolumeCapabilities and retry with matching capabilities. | | |||
| Parameters not supported | 3 INVALID_ARGUMENT | Indicates that the CO has specified invalid mutable parameter keys or values, or the specified volume cannot support the specified parameters. | Caller MAY verify volume capabilities by calling ValidateVolumeCapabilities and retry with matching capabilities. Indicates that no changes were made by the SP. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that if we accept this PR, we could replace INVALID_ARGUMENT in the half-way with INTERNAL. It is viable. But do note that INVALID_ARGUMENT also have the special meaning that external-resizer will retry at lower frequency. We will loss that if we use INTERNAL.
| Parameters not supported | 3 INVALID_ARGUMENT | Indicates that the CO has specified invalid mutable parameter keys or values, or the specified volume cannot support the specified parameters. | Caller MAY verify volume capabilities by calling ValidateVolumeCapabilities and retry with matching capabilities. Indicates that no changes were made by the SP. | | |
| Parameters not supported | 3 INVALID_ARGUMENT | Indicates that the CO has specified invalid mutable parameter keys or values, or the specified volume cannot support the specified parameters. | Caller MAY verify volume capabilities by calling ValidateVolumeCapabilities and retry with matching capabilities. Indicates that no changes were made by the SP in this call and any previous calls with the same arguments. | |
I think we should add this to make it safe for CO to abandon the operation after many retries.
Even if we add the words suggested above. It is still not perfectly safe to abandon the operation. For example as I mentioned in the slack thread:
- Modify from A to B: partially applied, INTERNAL is returned by SP
- Modify again to C: INVALID_ARGUMENT is returned by SP
Now, although the SP is completely complying to the CSI spec suggested here, we cannot safely abandon the operation because the volume is in half A half B state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "slow retry" behavior is not required by CSI, it's just what Kubernetes decides to do when the error is infeasible, because Kubernetes is a declarative system.
If you return INTERNAL, you will get a normal (faster) retry, which is presumably desirable if the operation was merely interrupted and can proceed with another attempt.
For situations where the SP is truly stuck and there was no way to detect that it would get stuck in advance, the only remedy is feedback to the user/operator and someone making a change to the request, or going to the underlying system and fixing some underlying problem that's invisible to the CO/SP.
In the case of VAC, the possible fixes are:
- The user changes the PVC's VAC back to the old one, or some other one and expects it to reconcile successfully
- An admin creates a new VAC with correct parameters (fixing whatever the problem was with the given VAC) and sets the PVC to that one, and expects it to reconcile successfully
- An operator goes straight to the underlying storage system and fixes whatever is preventing the original mutation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the CSI spec also add the points to suggest the SP what are the suggested paths to recover from these errors. Like what Ben listed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the CSI spec we try to be very hands-off on how the SPs achieve their goals. All we care about is the contract between them and the CO. This is frustrating to SP authors, but it's genuinely challenging to write a spec that is both as flexible as CSI and also reliable as the Kubernetes-side implementation of the spec.
Consider that this feature was originally envisioned as a way to set QoS policies on volumes and change them at a later time. I think it's pretty obvious how one would validate QoS policies and then mutate them idempotently. Other features that follow the same pattern are a good fit for this feature.
If there are use cases where the simple QoS policy pattern doesn't fit, we need to discuss those concrete use cases, otherwise, we assume that SPs can do whatever they want and they own all the problems. The specific issue here is that the CO has a problem and the only way to resolve it is to be more prescriptive in the spec, which would be a breaking change if it were not alpha, so we're trying to do it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also how about we use some stronger wording:
Indicates that one or more arguments in the request is invalid. The SP MUST NOT
have applied any modification to the volume as part of this specific call. A caller
may verify volume capabilities by calling ValidateVolumeCapabilities and then retry
the request with valid arguments.
And about the case @huww98:
- Modify from A to B: partially applied, INTERNAL is returned by SP
- Modify again to C: INVALID_ARGUMENT is returned by SP
The initial non-atomic failure has put the volume into an undefined and potentially unrecoverable state. When the modification from A to B failed and was only partially applied, the SP violated the spec's requirement for atomicity. The volume is now in a state (A') that was never intended to exist. This A' state is effectively "corrupt" from the perspective of state machine. It doesn't match the last known successful configuration (A), nor does it match the target configuration (B).
For these cases I think the inescapable conclusion is "Manual Intervention is Required". Do we want to write it into the Spec though? Beause Specs should focus on defining correct behavior and prevention, not cure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that if we accept this PR, we could replace INVALID_ARGUMENT in the half-way with INTERNAL. It is viable. But do note that INVALID_ARGUMENT also have the special meaning that external-resizer will retry at lower frequency. We will loss that if we use INTERNAL.
I think we should add this to make it safe for CO to abandon the operation after many retries.
Even if we add the words suggested above. It is still not perfectly safe to abandon the operation. For example as I mentioned in the slack thread:
1. Modify from A to B: partially applied, INTERNAL is returned by SP 2. Modify again to C: INVALID_ARGUMENT is returned by SP
Now, although the SP is completely complying to the CSI spec suggested here, we cannot safely abandon the operation because the volume is in half A half B state.
If some some reason A -> B passes validation, but still fails for unexpected reasons (i.e. INTERNAL error), my understanding is that the CO will continue retrying the modification to B, but if the user changes the VAC to C, then the CO will accept that if C contains all the mutable parameter keys that B did. If C doesn't contain all the keys that B did, it will keep trying to modify to B (this is for safety within the quota system).
If the SP rejects the modification to C with INVALID_ARGUMENT, then I suspect it gives up and goes back to trying to modify to B, because the CO interprets that error as if the VAC itself is invalid.
The point is that INVALID_ARGUMENT is a signal that the VAC itself is not usable, at least for that volume, and INTERNAL is a signal that the VAC was fine, and the underlying system has some problem, in which the normal kubernetes approach is to keep trying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are saying, if there are two parts, you have to validate both parts before applying part 1. We're discouraging SPs doing a sequence of validate, apply, validate, apply, etc.
I understand this. I'm trying to say, CSI spec needs to be precise, we need to formalize this into the spec. Just discouraging is not enough IMO.
the CO will accept that if C contains all the mutable parameter keys that B did
This is a solution, but is not a good one IMO. CSI spec said "COs will treat these (mutable_parameters) as opaque", which I think conflicts with this rule. In practice, consider this case:
- User modify to VAC A with
{"type": "cloud_essd", "performanceLevel": "PL3"}
and failed with INTERNAL error (out of stock, for example). - User then try to recover by modify again to VAC B with
{"type": "cloud_auto"}
, wherecloud_auto
does not haveperformanceLevel
parameter.
With your plan, CO should not accept B. But this is valid and safe from SP's PoV.
If the SP rejects the modification to C with INVALID_ARGUMENT, then I suspect it gives up and goes back to trying to modify to B, because the CO interprets that error as if the VAC itself is invalid.
This can be viable. But I suspect it can be hard for CO to implement this. Consider this:
- User modify to VAC A, failed with INTERNAL, because the credential used to communicate with storage backend is not correctly configured.
- User then modify to VAC B, C, D then E, in sequence, all failed with INTERNAL.
- Cluster admin correct the credential.
- CO retries E, it failed with INVALID_ARGUMENT.
- CO give up E then retries D, it also failed with INVALID_ARGUMENT.
- CO retries C, B, A in sequence, all failed with INVALID_ARGUMENT.
- now CO is safe to abandon this operation.
As you can see, in order to implement this, CO will need to persistent a potentially unbounded stack of VACs it has tried. We don't have a place to store this in Kubernetes PVC.status now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this. I'm trying to say, CSI spec needs to be precise, we need to formalize this into the spec. Just discouraging is not enough IMO.
We are being precise here in the matter which the CSI is able to be. We have limited control over what SPs and COs really do. The power of the spec is to specify the contract, and I feel that saying the SP must not have modified the volume if it returns INVALID_ARGUMENT is as good as we can do.
This is a solution, but is not a good one IMO. CSI spec said "COs will treat these (mutable_parameters) as opaque", which I think conflicts with this rule. In practice, consider this case:
I agree with you that this CO behavior violates the spirit of opaqueness in the spec, but I'm expressing what the CO already does, and how its trying to cope with a specific error recovery problem.
This can be viable. But I suspect it can be hard for CO to implement this. Consider this:
1. User modify to VAC A, failed with INTERNAL, because the credential used to communicate with storage backend is not correctly configured. 2. User then modify to VAC B, C, D then E, in sequence, all failed with INTERNAL. 3. Cluster admin correct the credential. 4. CO retries E, it failed with INVALID_ARGUMENT. 5. CO give up E then retries D, it also failed with INVALID_ARGUMENT. 6. CO retries C, B, A in sequence, all failed with INVALID_ARGUMENT. 7. now CO is safe to abandon this operation.
As you can see, in order to implement this, CO will need to persistent a potentially unbounded stack of VACs it has tried. We don't have a place to store this in Kubernetes PVC.status now.
@gnufied should respond to this hypothetical. My intuition is that it would be bad SP behavior to respond with INTERNAL first, and then INVALID_ARGUMENT later to the same volume/mutable parameters pair, but I can imagine how it could happen without technically violating the spec, and it does seem to suggest that the CO's current recovery scheme is not foolproof. Assuming the spec wording doesn't change and we can't create better CO behavior, then my advice to SP authors would be to not every do the above behavior -- make sure to return INVALID_ARGUMENT either immediately or never.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently CO (k8s) doesn't look into mutable_parameters at all. I was proposing that as a theoretical option in case, if we were to guard against partial application of VACs (like a bulletproof option at the cost of flexibility).
Basically at this point we concede that, recovering to a mutable_parameter that is a subset of proposed transition is lossy and it it is possible that volume will be left in a state where one or more of parameters were applied. This is why, we don't even call recover in CO, we are proposing this merely as a way to cancel ongoing ControllerModifyVolume
OP
The only way CO can solve this is by preventing recovery to subset of mutable_parameters
- User modify to VAC A, failed with INTERNAL, because the credential used to communicate with storage backend is not correctly configured.
- User then modify to VAC B, C, D then E, in sequence, all failed with INTERNAL.
- Cluster admin correct the credential.
- CO retries E, it failed with INVALID_ARGUMENT.
- CO give up E then retries D, it also failed with INVALID_ARGUMENT.
CO will not automatically retry recovery to any of the failed states, unless user takes some action to recover. I am not sure where is this coming from.
My intuition is that it would be bad SP behavior to respond with INTERNAL first, and then INVALID_ARGUMENT later to the same volume/mutable parameters pair,
In general not great, but CO does tolerate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically at this point we concede that, recovering to a mutable_parameter that is a subset of proposed transition is lossy and it it is possible that volume will be left in a state where one or more of parameters were applied. The only way CO can solve this is by preventing recovery to subset of mutable_parameters
We can also ask cluster admin to take responsibility of this. We can write this in the doc: if you want your quota to work as expected, you should fully specify the volume state in each VAC. This is required even if all modify is successful. e.g. If VAC A does not specify IOPS, but VAC B specify a very high IOPS, user can just switch to B then back to A to get free IOPS.
CO will not automatically retry recovery to any of the failed states, unless user takes some action to recover. I am not sure where is this coming from.
@bswartz proposed "I suspect it gives up and goes back to trying to modify to B", and I'm trying to generalize that "goes back" to "recursively goes back" when there are even more VACs involved.
My intuition is that it would be bad SP behavior to respond with INTERNAL first, and then INVALID_ARGUMENT later to the same volume/mutable parameters pair
@bswartz for example, if SP got "network connection timeout" while trying to verify the arguments, I think the best it can do is returning INTERNAL.
But why we need to abandon the operation in the first place? If modify from A to B failed, and user request A to CO again, CO then request A again to SP, it is simple and easy. And we don't need to this PR, or asking AWS to make the change. In "recover from expansion failure" feature, we also never abandon an expansion operation. We only request for smaller size. And only assume the volume is in good state when we got a success ControllerExpandVolume. Can we also do the same for ControllerModifyVolume? |
This is a decision for the SP to make. The purpose of the CSI spec is to tell the SP and CO how to behave in response to errors. We are trying to make it clear that if the SP wants the CO to give up, it should return INVALID_ARGUMENT, because it means that there is no more work to do. If the SP wants the CO to retry, it needs to return a non-terminal error like INTERNAL so another retry will be made. We can't have a situation where INVALID_ARGUMENT is ambiguous to the CO, because the CO is going to implement exactly one response to that error code. What we prefer is to treat that error as an infeasible error, and in order for that to be safe, then we have to specify that it's not okay to return it after some changes have already been made, and those changes could not be rolled back. We will expect SPs to comply with this behavior, or else suffer from bad behavior, such as slow retries or possibly never retries.
It's better not to think of a VAC change as a single operation. Think about the actual state and desired state, and one more reconciliation operations to transform from actual to desired. Each call to |
For context: kubernetes-csi/external-resizer#514 |
Okay @gnufied explained to me the point of the target VAC and the need for rollback support, because of quotas and concerns about exploitation of the quota system in case of failures. Based on that understanding I still think this CSI spec change makes sense, and non-conforming drivers will need to update. |
Yes, it is totally valid. I want CO to never abandon an operation without notifying SP. I want the CO to think the actual state is unknown when it starts the first But if we really want to proceed this PR, we should be careful about cases mentioned in #597 (comment) and #597 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
The EBS CSI Driver returning infeasible despite modifying tags is a bug that we put on our backlog to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for clarifying this, spec change lgtm.
/lgtm |
I think the spirit of this change is good. But even if it is merged, it will not resolve any issue we are facing. If I understand correctly, this PR is intended to make it safe for CO to abandon a modify operation without notifying the SP. Consider the following circumstances when modifying from state A, and user want to return to state A:
With this change, we only able to safely abandon the operation in case 1, which is the simplest. But if we do implement this feature in CO, we really want to also support case 2 and 3, because we don't want a random component restart or network error to break this feature. Case 4 and 5 are better-to-have though, CO may just not supporting canceling in this case. So CO really cannot take any advantage of this new promise from SP. IMO we should either make stronger requirement to SP, or don't make this change and leave more freedom to SP. While I've asked this before, it is still unanswered. Why would CO want to abandon the operation in the first place? Previously, we call ControllerUnpublishVolume to cancel a failed ControllerPublishVolume; we call ControllerExpandVolume again with different size to cancel a failed ControllerExpandVolume. All works well. Why can't we just do the same, call ControllerModifyVolume again with different parameter to cancel a failed ControllerModifyVolume? Do you want to avoid an extra gRPC call and save 100ms?
@bswartz this is your previous reply. I'm not sure but are you supporting my opinion that we never need to abandon the operation? If CO failed to reconcile to the desired state B and user set new desired state back to A. Can we just reconcile to state A again, instead of assuming the volume is not modified from state A and abandon the operation? |
I think I have answered this on Slack, but I will try and answer again. If were to design this feature with requirement that you can only recover to another VAC and if we want this to work like "true" recovery and no partial state is left on the volume from which it is being "recovered", then we will have to design this whole feature from ground up.
Luckily volume resizing does one thing and only one thing so it doesn't have this set of challenges. |
I think we can just leave this to documentation. This is about VAC and quota, not CSI spec. Currently, even not considering the recovering scenario, we already rely on the VAC to fully specify the volume state to make quota useful. e.g.
User can switch from A to B then back to A to get 2000 iops without using his quota on VAC B. So such issue is not specific to recovering.
I admit "recover to nil" case is special, and it is better to have cancelling mechanism. But I've thought about this for several hours these days, considered multiple alternatives of wording, but cannot get a satisfying solution. I found this line in CSI spec:
Although this is under Timeout subtitle, but this makes me think: cancelling is just not the correct way to go in CSI So how about just not rollback to nil VAC? When user set PVC.spec.VACName to nil, we just stop reconciling for this PVC, leave it at InProgress or Infeasible state. Then user will have several options:
Yes. ControllerExpandVolume only have one size parameter, so it is always fully specified. And it also returns the size after expansion, so that even if CO request smaller size when the volume is actually larger, CO will know the actual size. |
As per our discussion in CSI implementation meeting. Error handling for each of the cases is now documented in - kubernetes/enhancements#5462 This change to CSI spec should be IMO merged, because while it is not perfect it is better than any other alternative. |
What this PR does / why we need it:
This clarifies that the INVALID_ARGUMENT error code for ControllerModifyVolume is safe to not retry (i.e. "infeasible")
Does this PR introduce an API-breaking change?: