feat: Add ValidationPolicy & ValidationEvent #488

karlkfi · 2021-11-30T00:23:35Z

Add ValidationPolicy:
ExitEarly (default) - exit before apply/prune if any objects are invalid
SkipInvalid - apply/prune valid objects & skip invalid ones
Add ValidationEvent to be sent once for every invalid object or set of
objects, when the SkipInvalid policy is selected. For ExitEarly,
return an error for reverse compatibility.
Add validation.Collector to simplify aggregating validation errors
from multiple sources and extracting invalid object IDs.
Add invalid objects to the TestContext so they can be retained in
the inventory (only if already present). This primarily applies to
invalid annotations and dependencies. Objects without name or kind
should never be added to the inventory.
Update Solver to use validation.Collector and filter invalid objects.
Add e2e test for invalid objects.
Update Printers to handle ValidationEvent
Add ExternalDependencyError & InvalidAnnotationError to make it easier
to handle and introspect validation errors.

fixes: #484

pkg/object/validator/error.go

pkg/object/validator/validate.go

pkg/apply/applier.go

mortent · 2021-11-30T04:05:12Z

pkg/object/validator/policy.go

+package validator
+
+//go:generate stringer -type=ValidationPolicy
+type ValidationPolicy int


So it seems like SkipInvalid is the approach that best fits our overall approach to errors here. It essentially follows the "continue-on-error" direction, but also avoids applying resources that we know there are problems with. I think there will be some issues that will prevent us from continuing (name/namespace errors comes to mind. Also, circular dependencies isn't currently caught by the validation and it is not clear how we would skip resources in that situation), but I think moving towards making it possible to continue on as many problems as possible makes sense.
Could we simplify by making this the behavior until we have evidence that something else is needed?

I'm not convinced there is much need for the ApplyAll policy here, as it seems to suggest we'll be applying resources that we know have problems and effectively ignoring any annotations that can't be parsed.

I think ExitEarly do have some good use-cases, but I'm uncertain if this is the right way to do it. For most of the validation, we want to let users shift it left and detect these issues in their CI/CD pipelines rather than when they are ready to apply. So it seems like we want this to be available outside of a pre-step to actually applying it. Also, we already provide a dry-run flag, which should detect these issues as well as many additional validation errors.

So it seems like SkipInvalid is the approach that best fits our overall approach to errors here.

For now, yes. Tho in the future I'd like to add change groups so we can continue on error by default, but skip whole change groups if any of their objects are invalid. With that behavior, if you don't specify any change groups, all resources are considered part of the same change group, meaning that the default behavior would be equivalent to ExitEarly. Then if you want SkipInvalid behavior you just have to define multiple independent change groups. Or you could opt into SkipInvalid for all objects by default, equivalent to giving every object its own change group.

I think there will be some issues that will prevent us from continuing

Yes. It will likely error in multiple places. I'll need to add tests for invalid input and make sure it handles them as desired.

Also, circular dependencies isn't currently caught by the validation and it is not clear how we would skip resources in that situation.

Agreed. It's not obvious what the user's desired behavior would be in the case of circular dependencies either. I think for now we can just throw an error and exit and figure out how to skip it later, if needed.

Could we simplify by making [SkipInvalid] the behavior until we have evidence that something else is needed?

@seans3 suggested using SkipInvalid as the default behavior. I could be convinced to do it that way, but I explained above why I think ExitEarly is a more forwards compatible default.

IME building continuous deployment pipelines, I have always in the past preferred exiting early and stopping deployment if there's any problem that needs human input. Validation errors are almost always user mistakes, and partial deployment can easily cause production outage. So if given the option as a user, I would choose ExitEarly behavior.

That said, I know some users have asked for continue-on-error, and especially after you've already started applying, it's not obvious if continuing or stopping would cause more brokenness. But I suspect that if we presented these same users with the change group option I suggested, they would prefer that over just continue-on-error behavior.

I think ExitEarly do have some good use-cases, but I'm uncertain if this is the right way to do it.

This is the current behavior. Changing it would violate the principle of least surprise for users. Plus, Config Sync already does it's own validation earlier and would exit early anyway. It just doesn't validate all the same things, like the dependency annotations.

For most of the validation, we want to let users shift it left and detect these issues in their CI/CD pipelines rather than when they are ready to apply. So it seems like we want this to be available outside of a pre-step to actually applying it. Also, we already provide a dry-run flag, which should detect these issues as well as many additional validation errors.

IME, dry-run is only really useful in a few cases:

When applying manually and you want to be see a diff of the changes to make sure hydration didn't have any unexpected side effects.

When there are multiple clients modifying the same files and you want to make sure you're not reverting someone else's changes accidentally.

When applying with automation and you want a record of the diff to be stored for debugging later or the diff needs manual approval.

There's not really anything to run between dry-run and apply that makes it useful to do a dry-run in CI/CD otherwise, is there? And in all cases, you'd still want to validate again in the apply, in case they didn't run dry-run, right?

In any case, this validation code is the same code being used for both the apply and the dry-run, and I'm pretty sure we want the options and defaults to be the same for both, right?

I've removed ApplyAll for now. It's not actually possible to apply resources without a kind or name. So the feature wouldn't make much sense.

I've left ExitEarly as the default, because that's the current behavior. We can change this later, if we want.

This PR primarily adds SkipInvalid and makes it opt-in.

pkg/object/graph/depends.go

pkg/apply/applier.go

seans3

I believe this approach to validation is misguided for a couple reasons:

We should emphasize --dry-run as the primary validation mechanism. Instead of attempting to figure out problems ahead of time, --dry-run runs as much of the entire apply algorithm as possible, surfacing errors throughout the algorithm. This type of validation will almost always be better than pre-run client-side validation. It looks like the validations added here for depends-on and apply-time mutation are already implemented as errors in --dry-run. I do not see much utility for these validations other than better error messages.
Exiting early will be especially problematic, since this makes significant assumptions about what the library client wants and differs dramatically from the underlying kubectl apply which continues applying when encountering an apply error. Exiting early will probably require manually calling the final inventory set task to ensure the prune/inventory algorithm continues to work.
If a library clients want stronger validation (or domain-specific validation), they can always implement it themselves before calling the library apply. This library has to work for all clients, and exiting early makes strong assumptions.

I believe the goal of client-side validation should be much more modest, such as generating a more informative error messages. If the history of Kubernetes CLI's is any guide, over-ambitious client-side validation will be jettisoned eventually for dry-run and server-side validation. But as long as there is a validation option other than Exit Early, I will not object to this PR.

karlkfi · 2021-12-01T20:32:57Z

@seans3 The current existing behavior already exits early if the Validator errors. I don't understand why you think that's misguided. The code is littered with hundreds of places that assume that minimal validation has been done. It's not going to be a small change to fix that to skip invalid objects (which seems to be what you want).

What this PR primarily does is adds the ability to skip those invalid objects.

I also added more explicit depends-on & apply-time-mutation errors, but that's because these are client-side annotations and the server won't validate them. So if they're invalid and the applier continues without skipping them (the current behavior), then that's behaviorally incorrect. We should never completely ignore invalid depends-on & apply-time-mutation errors. We should either skip those invalid objects or exit early. Either way, the user needs to be notified. That's why I added the ValidationError, to make sure the user always gets notified, regardless of whether they chose SkipInvalid or ExitEarly.

Now that I've said that tho, I think the ApplyAll policy is probably only useful when not using depends-on or apply-time-mutation. Otherwise the behavior when they're invalid would be obviously incorrect. ApplyAll would basically be opting out of client-side validation and ignoring errors in client-side features. That's basically what you're championing when you're saying the applier should be more like kubectl and do less validation. The big difference is that kubectl doesn't have any client-side-only features like depends-on or apply-time-mutation.

karlkfi · 2021-12-02T21:15:12Z

I've made significant changes to the handling of dependency validation. Please re-review.

Move dependency validation from Validator to SortObjs.
Move SortObjs call from solver to applier/destroyer.
This allows the resulting errors to be treated as validation errors.
Modify SortObjs to return a MultiError so that all validation errors
can be sent and invalid objects can be skipped or cause early exit.
Add invalid objects to the TestContext so they can be retained in
the inventory (only if already present). This primarily applies to
invalid annotations and dependencies. Objects without name or kind
should never be added to the inventory.
Remove invalid objects from the applySet/pruneSet, rather than
filtering and skipping them in the apply/prune/wait tasks. This
reduces the number of events that are redundant now with validation
events.
Handle CyclicDependencyError as a validation error. It applies to
multiple objects, instead of just one, like ValidationError.
Handle validation of dependency errors in destroyer as well as
the applier.
Replace many -OrDie object commands that were panicing when an
invalid object was found. Add ValidateObjMetadata to do the
validation that the -OrDie functions were doing.
Replace MultiValidationError with a more generic MultiError.
Simplify ValidationError to optionally wrap MultiError.
Modify ValidationError output to be easier to read.
Add e2e test for invalid object handling.

I still have more tests to add.

pkg/apply/event/event.go

pkg/apply/applier.go

pkg/object/objmetadata_set.go

pkg/object/unstructured.go

karlkfi · 2022-01-24T20:29:53Z

/hold

karlkfi · 2022-01-25T02:06:55Z

/unhold

All TODOs finished. Ready for another review pass.

pkg/apply/solver/solver.go

mortent

I think this looks good. I see that we have had some discussions on this PR about when validation should happen, but the primary feature here is that we can continue applying resources in many more situations, which I think is a direction we all agree on. Allowing early exit seems reasonable to support, not least because that is the current behavior.

/approve

pkg/apply/applier.go

mortent · 2022-01-25T19:39:45Z

I'll remove the approval just to make sure @seans3 also has a chance to look at it before it merges.
/approve cancel

- Add ValidationPolicy: ExitEarly (default) - exit before apply/prune if any objects are invalid SkipInvalid - apply/prune valid objects & skip invalid ones - Add ValidationEvent to be sent once for every invalid object or set of objects, when the SkipInvalid policy is selected. For ExitEarly, return an error for reverse compatibility. - Add validation.Collector to simplify aggregating validation errors from multiple sources and extracting invalid object IDs. - Add invalid objects to the TestContext so they can be retained in the inventory (only if already present). This primarily applies to invalid annotations and dependencies. Objects without name or kind should never be added to the inventory. - Update Solver to use validation.Collector and filter invalid objects. - Add e2e test for invalid objects. - Update Printers to handle ValidationEvent - Add ExternalDependencyError & InvalidAnnotationError to make it easier to handle and introspect validation errors.

seans3

I do not believe exiting early should be the default behavior. Even if this behavior exists now, it is incorrect and should be fixed to continue on error. Exiting early is also not consistent with the kubectl apply behavior which continues on error.

karlkfi · 2022-01-25T20:46:36Z

I do not believe exiting early should be the default behavior. Even if this behavior exists now, it is incorrect and should be fixed to continue on error. Exiting early is also not consistent with the kubectl apply behavior which continues on error.

It's likely that ConfigSync and kpt will both initially use ExitEarly when adopting this new code. They may later switch to either hard coding SkipInvalid OR piping the option through to the user. Does that affect your default suggestion @seans3 ?

karlkfi · 2022-01-26T00:51:20Z

FWIW, I do not consider ExitEarly to a be a bug. Its a feature which arises from the desire to group related changes. Of course, Kubernetes doesn't have atomic applies or groups, but you can at least avoid applying a Deployment if its Secret, ServiceAccount, or ConfigMap is invalid. This behavior is highly desired when you take the time to granularly group your objects.

On the other hand, SkipInvalid is attractive when you haven't or don't intend to group your objects. This comes up regularly when multiple teams share a single config repo or delegate config management to a central SRE/DevOps/Ops team.

I'm ok with either being default. I don't think it really matters. There are pros and cons on both sides.

ExitEarly is reverse compatible, which is desirable for existing kpt users because they're releasing beta versions and breaking changes aren't being released as major versions yet.
SkipInvalid might be more desirable to new kpt users who are more familiar with kubectl.
Any wrapping client can select their own default... So as long as we communicate breaking changes, I'm not against them.

I'm biasing slightly towards reverse compatibility here, but if anyone else wants to chime in with a 3rd vote I don't mind making the change.

mortent · 2022-01-26T20:53:50Z

I would probably prefer that we keep the current behavior here (so ExitEarly is the default), but I don't feel strongly about it.

haiyanmeng · 2022-01-26T21:05:57Z

For Config Sync users, the default should be ExitEarly, which is closer to the existing validation behavior in Config Sync.

Supporting the SkipInvalid validation policy would require lots of work from Config Sync. Today, Config Sync validates the configs declared in a Git repo, and only sends the configs to cli-utils to apply after the configs pass all the validations (there are about several dozens of validations in Config Sync).

seans3

/lgtm

k8s-ci-robot · 2022-01-27T23:53:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: karlkfi, seans3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [seans3]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 30, 2021

karlkfi requested a review from mortent November 30, 2021 00:23

k8s-ci-robot requested review from haiyanmeng and seans3 November 30, 2021 00:23

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 30, 2021

haiyanmeng reviewed Nov 30, 2021

View reviewed changes

mortent reviewed Nov 30, 2021

View reviewed changes

seans3 reviewed Dec 1, 2021

View reviewed changes

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 2, 2021

karlkfi requested review from seans3, haiyanmeng and mortent December 2, 2021 21:16

karlkfi mentioned this pull request Dec 2, 2021

Feature Request: allow apply to ignore some resources #393

Closed

karlkfi force-pushed the karl-validation-event branch from 63c2cfc to da3f7de Compare December 3, 2021 01:33