Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: implicit namespace resolution in apply-time-mutation causes dependency sorting failure #481

Open
karlkfi opened this issue Nov 12, 2021 · 17 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@karlkfi
Copy link
Contributor

karlkfi commented Nov 12, 2021

Resources:

  • namespace test
  • pod-a
    • implicit dep: namespace test
  • pob-b
    • implicit dep: namespace test
    • explicit dep: apply-time mutation with pod-a as source

Output (from kpt using cli-utils master):

$ kpt live apply
namespace/test unchanged
1 resource(s) applied. 0 created, 1 unchanged, 0 configured, 0 failed
pod/pod-a created
pod/pod-b apply failed: failed to mutate "test_pod-b__Pod" with "ApplyTimeMutator": failed to read field ($.status.podIP) from source resource (/namespaces/test/Pod/pod-a): expected 1 match, but found 0)
2 resource(s) applied. 1 created, 0 unchanged, 0 configured, 1 failed
error: 1 resources failed

Problem:

  • pod-b didn't get sorted into a 3rd ApplyTask. So when it applied, pod-a wasn't reconciled yet.
@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 12, 2021

Explicit dep with depends-on works, so it's just apply-time-mutation that doesn't (weird!).

$ kpt live apply
namespace/test unchanged
1 resource(s) applied. 0 created, 1 unchanged, 0 configured, 0 failed
pod/pod-a created
1 resource(s) applied. 1 created, 0 unchanged, 0 configured, 0 failed
pod/pod-b created
1 resource(s) applied. 1 created, 0 unchanged, 0 configured, 0 failed

@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 12, 2021

apply-time-mutation works if the source namespace is specified. So it's just the implicit namespace resolution that breaks graph sorting.

@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 12, 2021

Root cause:
graph.Sort only uses mutation.ReadAnnotation, which returns an empty Namespace, if not specified by the user. The implicit namespace resolution is performed in ApplyTimeMutator.Mutate.

Unfortunately, the easy solution of moving implicit namespace resolution into mutation.ReadAnnotation doesn't work, because it needs to know whether the Source and Target are both namespaced. This would require using the mapper to resolve the resource schemas. If we try to use the mapper in graph.Sort it would fail for resources that haven't had their CRD applied yet.

So this is another issue with mapping lookups happening up front instead of lazily as-needed. Changing that, to resolve sort order after every apply, would be a significant change...

Also unfortunate is that we can't reject SourceRefs without an explicit namespace without knowing if the resource is namespaced or not...

@karlkfi karlkfi changed the title Bug: dependency chain with namespace and apply-time-mutation fails Bug: implicit namespace resolution in apply-time-mutation causes dependency sorting failure Nov 12, 2021
@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 12, 2021

Possible workarounds:

  1. Resolve mapping up front, if possible, and error at apply time, if not
    • Add a ResourceReference field to specify if the resource is cluster or namespace scoped.
    • Perform implicit namespace resolution before sorting (to avoid needing to inject the mapper into graph.SortObjs).
    • If there's no schema and the namespace is empty and the resource is namespace scoped, it will still error at apply time, but all other cases will work as expected.
  2. Disable implicit namespace resolution
    • Error at apply time if namespace is empty and resource is namespace scoped (makes the apply error more actionable, but requires namespace to always be specified)
  3. Change sorting to only sort one stage at a time (dynamic task scheduling).
    • Add a SortTask after every Apply+Wait and Delete+Wait. This new task would determine the next tasks to execute and add them to the task queue.
    • Replace graph.SortObjs with a graph.Next (or similar) that just returns the next UnstructuredSet to apply/delete.
  4. Resolve implicit namespace by checking against the UnstructuredSet being sorted,
    • Lookup the resource both without the namespace and with the namespace, to see if there is a match.
    • If there's no match, skip adding the graph edge and log a warning
    • This only works for dependencies in the resource set (not external deps: [WIP] Add support for external dependencies #412)

@haiyanmeng
Copy link
Contributor

  1. Disable implicit namespace resolution

Today, all the namespaced objects sent from Config Sync to cli-utils have the metadata.namespace field explicitly set.
Is it reasonable to expect this from other users of cli-utils ?

@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 13, 2021

Today, all the namespaced objects sent from Config Sync to cli-utils have the metadata.namespace field explicitly set.
Is it reasonable to expect this from other users of cli-utils ?

cli-utils already requires this.

This problem here is with apply-time-mutation annotations which were initially designed to support inferring the namespace of dependencies from the namespace of the dependent object. However, it does look like disabling implicit namespace resolution for apply-time-mutation is the easier path forward, and would align with the requirement on the objects themselves.

@haiyanmeng
Copy link
Contributor

cli-utils already requires this.

I don't think this is true. I tested with kapply built from the master, which doesn't require the metadata.namespace field of a Deployment object to be explicitly set.

@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 16, 2021

The Applier runs the Valdator which runs the validateNamespace method, passing in the list of CRDs in the set and the mapper.

The valdiator should error if it's namespace-scoped with no namespace or cluster-scoped with a namespace.

https://github.com/kubernetes-sigs/cli-utils/blob/master/pkg/object/validate.go#L124

@haiyanmeng
Copy link
Contributor

Accurately speaking, Applier.Run requires the metadata.namespace field of a namespaced object to be explicitly set.

kapply does not require this, since it sets the metadata.namespace field of a namespaced object if it is missing.

@karlkfi
Copy link
Contributor Author

karlkfi commented Nov 16, 2021

Thanks for clarifying. I don't know if having kapply do that really helps us test, but it might be trying to duplicate kpt and kubectl behavior.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 14, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 16, 2022
@karlkfi
Copy link
Contributor Author

karlkfi commented Mar 25, 2022

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 25, 2022
@karlkfi
Copy link
Contributor Author

karlkfi commented Mar 25, 2022

Short term fix is to disable implicit namespaces, since they don't work as-is: #482

Long term fix is unknown.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 23, 2022
@karlkfi
Copy link
Contributor Author

karlkfi commented Jul 24, 2022

/remove-lifecycle rotten
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants