Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ValidationPolicy & ValidationEvent #488

Merged
merged 1 commit into from
Jan 28, 2022

Conversation

karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Nov 30, 2021

  • Add ValidationPolicy:
    ExitEarly (default) - exit before apply/prune if any objects are invalid
    SkipInvalid - apply/prune valid objects & skip invalid ones
  • Add ValidationEvent to be sent once for every invalid object or set of
    objects, when the SkipInvalid policy is selected. For ExitEarly,
    return an error for reverse compatibility.
  • Add validation.Collector to simplify aggregating validation errors
    from multiple sources and extracting invalid object IDs.
  • Add invalid objects to the TestContext so they can be retained in
    the inventory (only if already present). This primarily applies to
    invalid annotations and dependencies. Objects without name or kind
    should never be added to the inventory.
  • Update Solver to use validation.Collector and filter invalid objects.
  • Add e2e test for invalid objects.
  • Update Printers to handle ValidationEvent
  • Add ExternalDependencyError & InvalidAnnotationError to make it easier
    to handle and introspect validation errors.

fixes: #484

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 30, 2021
@karlkfi karlkfi requested a review from mortent November 30, 2021 00:23
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 30, 2021
pkg/object/validator/error.go Outdated Show resolved Hide resolved
pkg/object/validator/validate.go Outdated Show resolved Hide resolved
pkg/object/validator/validate.go Outdated Show resolved Hide resolved
pkg/apply/applier.go Outdated Show resolved Hide resolved
pkg/apply/applier.go Outdated Show resolved Hide resolved
package validator

//go:generate stringer -type=ValidationPolicy
type ValidationPolicy int
Copy link
Member

@mortent mortent Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems like SkipInvalid is the approach that best fits our overall approach to errors here. It essentially follows the "continue-on-error" direction, but also avoids applying resources that we know there are problems with. I think there will be some issues that will prevent us from continuing (name/namespace errors comes to mind. Also, circular dependencies isn't currently caught by the validation and it is not clear how we would skip resources in that situation), but I think moving towards making it possible to continue on as many problems as possible makes sense.
Could we simplify by making this the behavior until we have evidence that something else is needed?

I'm not convinced there is much need for the ApplyAll policy here, as it seems to suggest we'll be applying resources that we know have problems and effectively ignoring any annotations that can't be parsed.

I think ExitEarly do have some good use-cases, but I'm uncertain if this is the right way to do it. For most of the validation, we want to let users shift it left and detect these issues in their CI/CD pipelines rather than when they are ready to apply. So it seems like we want this to be available outside of a pre-step to actually applying it. Also, we already provide a dry-run flag, which should detect these issues as well as many additional validation errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems like SkipInvalid is the approach that best fits our overall approach to errors here.

For now, yes. Tho in the future I'd like to add change groups so we can continue on error by default, but skip whole change groups if any of their objects are invalid. With that behavior, if you don't specify any change groups, all resources are considered part of the same change group, meaning that the default behavior would be equivalent to ExitEarly. Then if you want SkipInvalid behavior you just have to define multiple independent change groups. Or you could opt into SkipInvalid for all objects by default, equivalent to giving every object its own change group.

I think there will be some issues that will prevent us from continuing

Yes. It will likely error in multiple places. I'll need to add tests for invalid input and make sure it handles them as desired.

Also, circular dependencies isn't currently caught by the validation and it is not clear how we would skip resources in that situation.

Agreed. It's not obvious what the user's desired behavior would be in the case of circular dependencies either. I think for now we can just throw an error and exit and figure out how to skip it later, if needed.

Could we simplify by making [SkipInvalid] the behavior until we have evidence that something else is needed?

@seans3 suggested using SkipInvalid as the default behavior. I could be convinced to do it that way, but I explained above why I think ExitEarly is a more forwards compatible default.

IME building continuous deployment pipelines, I have always in the past preferred exiting early and stopping deployment if there's any problem that needs human input. Validation errors are almost always user mistakes, and partial deployment can easily cause production outage. So if given the option as a user, I would choose ExitEarly behavior.

That said, I know some users have asked for continue-on-error, and especially after you've already started applying, it's not obvious if continuing or stopping would cause more brokenness. But I suspect that if we presented these same users with the change group option I suggested, they would prefer that over just continue-on-error behavior.

I think ExitEarly do have some good use-cases, but I'm uncertain if this is the right way to do it.

This is the current behavior. Changing it would violate the principle of least surprise for users. Plus, Config Sync already does it's own validation earlier and would exit early anyway. It just doesn't validate all the same things, like the dependency annotations.

For most of the validation, we want to let users shift it left and detect these issues in their CI/CD pipelines rather than when they are ready to apply. So it seems like we want this to be available outside of a pre-step to actually applying it. Also, we already provide a dry-run flag, which should detect these issues as well as many additional validation errors.

IME, dry-run is only really useful in a few cases:

  1. When applying manually and you want to be see a diff of the changes to make sure hydration didn't have any unexpected side effects.
  2. When there are multiple clients modifying the same files and you want to make sure you're not reverting someone else's changes accidentally.
  3. When applying with automation and you want a record of the diff to be stored for debugging later or the diff needs manual approval.

There's not really anything to run between dry-run and apply that makes it useful to do a dry-run in CI/CD otherwise, is there? And in all cases, you'd still want to validate again in the apply, in case they didn't run dry-run, right?

In any case, this validation code is the same code being used for both the apply and the dry-run, and I'm pretty sure we want the options and defaults to be the same for both, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed ApplyAll for now. It's not actually possible to apply resources without a kind or name. So the feature wouldn't make much sense.

I've left ExitEarly as the default, because that's the current behavior. We can change this later, if we want.

This PR primarily adds SkipInvalid and makes it opt-in.

pkg/object/graph/depends.go Outdated Show resolved Hide resolved
pkg/apply/applier.go Outdated Show resolved Hide resolved
Copy link
Contributor

@seans3 seans3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this approach to validation is misguided for a couple reasons:

  • We should emphasize --dry-run as the primary validation mechanism. Instead of attempting to figure out problems ahead of time, --dry-run runs as much of the entire apply algorithm as possible, surfacing errors throughout the algorithm. This type of validation will almost always be better than pre-run client-side validation. It looks like the validations added here for depends-on and apply-time mutation are already implemented as errors in --dry-run. I do not see much utility for these validations other than better error messages.
  • Exiting early will be especially problematic, since this makes significant assumptions about what the library client wants and differs dramatically from the underlying kubectl apply which continues applying when encountering an apply error. Exiting early will probably require manually calling the final inventory set task to ensure the prune/inventory algorithm continues to work.
  • If a library clients want stronger validation (or domain-specific validation), they can always implement it themselves before calling the library apply. This library has to work for all clients, and exiting early makes strong assumptions.

I believe the goal of client-side validation should be much more modest, such as generating a more informative error messages. If the history of Kubernetes CLI's is any guide, over-ambitious client-side validation will be jettisoned eventually for dry-run and server-side validation. But as long as there is a validation option other than Exit Early, I will not object to this PR.

@karlkfi
Copy link
Contributor Author

karlkfi commented Dec 1, 2021

@seans3 The current existing behavior already exits early if the Validator errors. I don't understand why you think that's misguided. The code is littered with hundreds of places that assume that minimal validation has been done. It's not going to be a small change to fix that to skip invalid objects (which seems to be what you want).

What this PR primarily does is adds the ability to skip those invalid objects.

I also added more explicit depends-on & apply-time-mutation errors, but that's because these are client-side annotations and the server won't validate them. So if they're invalid and the applier continues without skipping them (the current behavior), then that's behaviorally incorrect. We should never completely ignore invalid depends-on & apply-time-mutation errors. We should either skip those invalid objects or exit early. Either way, the user needs to be notified. That's why I added the ValidationError, to make sure the user always gets notified, regardless of whether they chose SkipInvalid or ExitEarly.

Now that I've said that tho, I think the ApplyAll policy is probably only useful when not using depends-on or apply-time-mutation. Otherwise the behavior when they're invalid would be obviously incorrect. ApplyAll would basically be opting out of client-side validation and ignoring errors in client-side features. That's basically what you're championing when you're saying the applier should be more like kubectl and do less validation. The big difference is that kubectl doesn't have any client-side-only features like depends-on or apply-time-mutation.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 2, 2021
@karlkfi
Copy link
Contributor Author

karlkfi commented Dec 2, 2021

I've made significant changes to the handling of dependency validation. Please re-review.

  • Move dependency validation from Validator to SortObjs.
  • Move SortObjs call from solver to applier/destroyer.
    This allows the resulting errors to be treated as validation errors.
  • Modify SortObjs to return a MultiError so that all validation errors
    can be sent and invalid objects can be skipped or cause early exit.
  • Add invalid objects to the TestContext so they can be retained in
    the inventory (only if already present). This primarily applies to
    invalid annotations and dependencies. Objects without name or kind
    should never be added to the inventory.
  • Remove invalid objects from the applySet/pruneSet, rather than
    filtering and skipping them in the apply/prune/wait tasks. This
    reduces the number of events that are redundant now with validation
    events.
  • Handle CyclicDependencyError as a validation error. It applies to
    multiple objects, instead of just one, like ValidationError.
  • Handle validation of dependency errors in destroyer as well as
    the applier.
  • Replace many -OrDie object commands that were panicing when an
    invalid object was found. Add ValidateObjMetadata to do the
    validation that the -OrDie functions were doing.
  • Replace MultiValidationError with a more generic MultiError.
  • Simplify ValidationError to optionally wrap MultiError.
  • Modify ValidationError output to be easier to read.
  • Add e2e test for invalid object handling.

I still have more tests to add.

pkg/apply/event/event.go Outdated Show resolved Hide resolved
pkg/apply/applier.go Outdated Show resolved Hide resolved
pkg/apply/applier.go Outdated Show resolved Hide resolved
pkg/object/unstructured.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 8, 2021
@karlkfi karlkfi force-pushed the karl-validation-event branch 2 times, most recently from e80d073 to 11a4451 Compare January 6, 2022 00:47
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 6, 2022
@karlkfi karlkfi force-pushed the karl-validation-event branch 2 times, most recently from 0536309 to b713a41 Compare January 6, 2022 01:05
@karlkfi karlkfi force-pushed the karl-validation-event branch from 6bd43f3 to bd28757 Compare January 24, 2022 20:07
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 24, 2022
@karlkfi
Copy link
Contributor Author

karlkfi commented Jan 24, 2022

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 24, 2022
@karlkfi karlkfi force-pushed the karl-validation-event branch from bd28757 to e053421 Compare January 25, 2022 02:05
@karlkfi
Copy link
Contributor Author

karlkfi commented Jan 25, 2022

/unhold

All TODOs finished. Ready for another review pass.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 25, 2022
@karlkfi karlkfi requested a review from mortent January 25, 2022 02:07
pkg/apply/solver/solver.go Outdated Show resolved Hide resolved
@karlkfi karlkfi force-pushed the karl-validation-event branch from e053421 to 430b203 Compare January 25, 2022 18:22
Copy link
Member

@mortent mortent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. I see that we have had some discussions on this PR about when validation should happen, but the primary feature here is that we can continue applying resources in many more situations, which I think is a direction we all agree on. Allowing early exit seems reasonable to support, not least because that is the current behavior.

/approve

pkg/apply/applier.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2022
@mortent
Copy link
Member

mortent commented Jan 25, 2022

I'll remove the approval just to make sure @seans3 also has a chance to look at it before it merges.
/approve cancel

@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2022
@karlkfi karlkfi force-pushed the karl-validation-event branch 2 times, most recently from e4857ed to fae20ab Compare January 25, 2022 20:09
- Add ValidationPolicy:
  ExitEarly (default) - exit before apply/prune if any objects are invalid
  SkipInvalid - apply/prune valid objects & skip invalid ones
- Add ValidationEvent to be sent once for every invalid object or set of
  objects, when the SkipInvalid policy is selected. For ExitEarly,
  return an error for reverse compatibility.
- Add validation.Collector to simplify aggregating validation errors
  from multiple sources and extracting invalid object IDs.
- Add invalid objects to the TestContext so they can be retained in
  the inventory (only if already present). This primarily applies to
  invalid annotations and dependencies. Objects without name or kind
  should never be added to the inventory.
- Update Solver to use validation.Collector and filter invalid objects.
- Add e2e test for invalid objects.
- Update Printers to handle ValidationEvent
- Add ExternalDependencyError & InvalidAnnotationError to make it easier
  to handle and introspect validation errors.
@karlkfi karlkfi force-pushed the karl-validation-event branch from fae20ab to 9acdbce Compare January 25, 2022 20:36
Copy link
Contributor

@seans3 seans3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe exiting early should be the default behavior. Even if this behavior exists now, it is incorrect and should be fixed to continue on error. Exiting early is also not consistent with the kubectl apply behavior which continues on error.

@karlkfi
Copy link
Contributor Author

karlkfi commented Jan 25, 2022

I do not believe exiting early should be the default behavior. Even if this behavior exists now, it is incorrect and should be fixed to continue on error. Exiting early is also not consistent with the kubectl apply behavior which continues on error.

It's likely that ConfigSync and kpt will both initially use ExitEarly when adopting this new code. They may later switch to either hard coding SkipInvalid OR piping the option through to the user. Does that affect your default suggestion @seans3 ?

@karlkfi
Copy link
Contributor Author

karlkfi commented Jan 26, 2022

FWIW, I do not consider ExitEarly to a be a bug. Its a feature which arises from the desire to group related changes. Of course, Kubernetes doesn't have atomic applies or groups, but you can at least avoid applying a Deployment if its Secret, ServiceAccount, or ConfigMap is invalid. This behavior is highly desired when you take the time to granularly group your objects.

On the other hand, SkipInvalid is attractive when you haven't or don't intend to group your objects. This comes up regularly when multiple teams share a single config repo or delegate config management to a central SRE/DevOps/Ops team.

I'm ok with either being default. I don't think it really matters. There are pros and cons on both sides.

  • ExitEarly is reverse compatible, which is desirable for existing kpt users because they're releasing beta versions and breaking changes aren't being released as major versions yet.
  • SkipInvalid might be more desirable to new kpt users who are more familiar with kubectl.
  • Any wrapping client can select their own default... So as long as we communicate breaking changes, I'm not against them.

I'm biasing slightly towards reverse compatibility here, but if anyone else wants to chime in with a 3rd vote I don't mind making the change.

@mortent
Copy link
Member

mortent commented Jan 26, 2022

I would probably prefer that we keep the current behavior here (so ExitEarly is the default), but I don't feel strongly about it.

@haiyanmeng
Copy link
Contributor

haiyanmeng commented Jan 26, 2022

For Config Sync users, the default should be ExitEarly, which is closer to the existing validation behavior in Config Sync.

Supporting the SkipInvalid validation policy would require lots of work from Config Sync. Today, Config Sync validates the configs declared in a Git repo, and only sends the configs to cli-utils to apply after the configs pass all the validations (there are about several dozens of validations in Config Sync).

Copy link
Contributor

@seans3 seans3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: karlkfi, seans3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2022
@k8s-ci-robot k8s-ci-robot merged commit 5fb19a1 into kubernetes-sigs:master Jan 28, 2022
@karlkfi karlkfi deleted the karl-validation-event branch January 28, 2022 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Validation Error Handling
5 participants