Skip to content

Try out if kro could be feasible as deployment-tool #164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
9 of 10 tasks
frewilhelm opened this issue Mar 26, 2025 · 11 comments
Closed
9 of 10 tasks

Try out if kro could be feasible as deployment-tool #164

frewilhelm opened this issue Mar 26, 2025 · 11 comments
Assignees
Milestone

Comments

@frewilhelm
Copy link
Contributor

frewilhelm commented Mar 26, 2025

Description

As described in #149, the ocm-controllers require something to deploy OCM resources. The v1 controllers used a custom operator FluxDeployer to achieve this.

We want to try out if kro could be our deployer for the v2 controllers, as kro is an resource orchestration tool and offers several configuration possibilities to alter and configure values in deployments. In combination with FluxCD, we could use

We need the following use-cases to work:

  • simple helm: Use FluxCD to deploy a helm-chart inside an OCM resource

  • simple kustomize: Use FluxCD to deploy a kustomization inside an OCM resource

  • simple configuration helm: Use FluxCD to deploy a helm-chart and configure values inside the helm-chart

  • simple configuration kustomization: Use FluxCD to deploy a kustomization and configure values inside the manifests

  • A complicated one containing

    • configuration
    • localisation
    • instance-spec parameters
    • passing a secret
    • (Also include the replication)
  • RGD in CV

    • Requires new operator, e.g. OCMDeployer that deploys the RGD

How would we recommend to use the kro-resources?

  • ???

How can resources be upgraded/update?

Timebox: 3 day(s)

Done Criteria

  • Estimation of impact on existing code incl. tests
  • Estimation of impact on Enduser Documentation updated (if applicable)
  • Estimation of impact on Internal technical Documentation created/updated (if applicable)
  • Created refinable tasks for the actual implementation
@frewilhelm frewilhelm moved this from 🆕 ToDo to 🏗 In Progress in OCM Backlog Board Mar 26, 2025
@frewilhelm frewilhelm added needs/validation Validate the issue and assign a priority needs/refinement Discuss with the team and gain a shared understanding labels Mar 26, 2025
@ikhandamirov ikhandamirov removed needs/validation Validate the issue and assign a priority needs/refinement Discuss with the team and gain a shared understanding labels Mar 27, 2025
@frewilhelm
Copy link
Contributor Author

Estimations

Estimation of impact on existing code incl. tests

  • Configuration and localisation could be replaced by kro ResourceGraphDefintion CEL statements
    • Configuration + localisation while deploying (no intermediate results are stored anymore)
      • Values for configuration and localisation (e.g. image locations, ...) can be set hardcoded, passed through kros instance, or dynamically referencing other k8s resources (as long as they are known in the cluster and accessed inside the graph).
    • to localise we deploy a CRD ocm resource and store its access information about the source OCI reference in its status. Accordingly, we do not need an intermediate layer to provide the resources as the original source is used.
      • works only for OCI artifacts
      • kro does not allow to reference json apiextension.RAW (problem: dynamic fields that are not know previously). This raises an error in the ResourceGraphDefinition as its dry-run (checking if fields exists) is accordingly failing.
        • API changes are required
    • Omitting all CRDs for configuration and localisation, e.g. localisation-rules.
  • OCI storage backend implementation + zot-registry could be removed, assuming we don't need to store any resources from localisation or configuration (compatibility layer)
    • OCM component descriptor (lists) cannot be stored anymore. Adjustments are needed
    • Omitting the storage backend means that localBlob resources cannot be deployed as every resources needs a source OCI registry from which it can be fetched.
  • e2e-tests must be adjusted
    • Deployment of kro
    • Replace current resources with ResourceGraphDefinitions
  • "unit" controller tests must be adjusted to work without the storage
  • Config inheritance could be omitted as it can be specified in the ResourceGraphDefinition directly. Would reduce complexity towards propagation-policy of each config.
  • Probably requires the possibility to pack a ResourceGraphDefinition in an ocm component version and use it for deployment
    • Requires a new operator

Estimation of impact on Enduser Documentation updated (if applicable)

Estimation of impact on Internal technical Documentation created/updated (if applicable)

  • Technical documentation is already outdated and requires some work either way

Created refinable tasks for the actual implementation

  • As this spike is part of an ADR (Create a Deployment ADR #136) the refinable tasks will be part of the decision of this ADR. A quick overview is presented in the "Estimation of impact on exists code incl. tests"

@frewilhelm
Copy link
Contributor Author

Current progress is saved in https://github.com/frewilhelm/ocm-k8s-toolkit/tree/spike_kro (based on #98)

@frewilhelm
Copy link
Contributor Author

A potential blocker could be that instances are not reconciled when their graph is updated. Thus, changes will not be propagated to the resources.

However, there are at least two issues, one of them a feature request from the maintainers themselves, that address this issue and want to fix it.

@ikhandamirov
Copy link
Contributor

Wow, sounds like a huge simplification!

@frewilhelm
Copy link
Contributor Author

In a scenario in which the ResourceGraphDefinition is part of the CV there is a possibility of a potential race condition:

  • CV contains
    • OCM Resource (e.g. HelmChart to-be-deployed)
    • ResourceGraphDefinition that contains
      • k8s OCM resource (refers to OCM Resource in CV)
      • FluxCD OCI Repository (refers to location of CV stored in OCI registry)
      • FluxCD HelmRelease (points to FluxCD OCI Repository)

To deploy the resource, the user has to deploy the k8s resources OCMRepository (points to OCI registry in which the CV is stored), component (points to the component name and OCMRepository), resource (points to ResourceGraphDefinition in the CV and component), and a new CRD OCMDeployer (or the like) that references the resource for the ResourceGraphDefinition. The OCMDeployer takes the manifest of the ResourceGraphDefinition and deploys it. After creating an instance for the new kind from the ResourceGraphDefinition, kro will deploy the OCM resource (the HelmChart) using the resources for OCM and Flux from the ResourceGraphDefinition.

The problem arises when the CV is updated with a new version. This triggers an update of the k8s resource component. However, the resource component is watched by (a) the RGD resource that is also watched by the OCMDeployer and (b) the k8s resource resource that contains the HelmChart. As an example, assume that the k8s resource resource for the HelmChart is removed from the updated CV. The trigger/watch for the RGD resource is fine as this will just deploy the new graph. But what happens with the original k8s resource resource for the HelmChart. If it reconciles before the RGD is updated, then it will fail as the resource cannot be find anymore in the component.

This is not necessarily a problem in the context of eventually consistency as we expect that the k8s resource resource for the helmChart can fail at first, but will then be removed, when the RGD is updated (we assume the RGD is updated as well as one resource was deleted). But we should keep such scenarios in mind.

@Skarlso
Copy link
Contributor

Skarlso commented Mar 31, 2025

Also, let's consider the considerable setup complication ( needing to install and configure kro ) plus the overhead of people learning kro and maintenance of kro version ( unless we farm this out to the infra maintainers which is as additional burden on them in that case ).

OCI storage backend implementation + zot-registry could be removed, assuming we don't need to store any resources from localisation or configuration (compatibility layer)

I don't understand this one. :) The registry is a cache and a sync point. It's not just there to share results, but it's also there so that we don't have to re-download a 6 gigabyte image when it's being fetched from somewhere over and over dealing with the same component or resource. Also, it's not explained how you would work with Flux then? Like, how do you present it with the created artifact that it needs to deploy?

@frewilhelm
Copy link
Contributor Author

The registry is a cache and a sync point. It's not just there to share results, but it's also there so that we don't have to re-download a 6 gigabyte image when it's being fetched from somewhere over and over dealing with the same component or resource

If we omit the configuration and localisation (or rather move it to kros ResourceGraphDefinition), then we do not have to download any image at all.

If the image must be available in a specific environment, one could use the replication controller to move the component version to that environment.

Also, it's not explained how you would work with Flux then? Like, how do you present it with the created artifact that it needs to deploy?

Assuming we omit the OCI registry, then we would need to publish the original source (= OCI registry, HelmRepository, git Repository, or the like) in the status of the resource. This information can then be taken as CEL to pass the location for example to a FluxCD OCI Repository. Consider the following example of an ResourceGraphDefinition:

apiVersion: kro.run/v1alpha1
kind: ResourceGraphDefinition
metadata:
  name: complicated-deployment
spec:
  schema:
    apiVersion: v1alpha1
    kind: ComplicatedDeployment
    spec:
      podinfo:
        releaseName: string
        message: string | default="hello, world"
  resources:
    - id: resourceChart
      template:
        apiVersion: delivery.ocm.software/v1alpha1
        kind: Resource
        metadata:
          name: static-resource-chart-name
        spec:
          componentRef:
            name: static-component-name # should be referenced/passed
          resource:
            byReference:
              resource:
                name: helm-resource
          interval: 10m
    - id: resourceImage
      template:
        apiVersion: delivery.ocm.software/v1alpha1
        kind: Resource
        metadata:
          name: static-resource-image-name
        spec:
          componentRef:
            name: static-component-name # should be referenced/passed
          resource:
            byReference:
              resource:
                name: image
          interval: 10m
    - id: ocirepository
      template:
        apiVersion: source.toolkit.fluxcd.io/v1beta2
        kind: OCIRepository
        metadata:
          name: helm-podinfo-config
        spec:
          interval: 1m0s
          layerSelector:
            mediaType: "application/vnd.cncf.helm.chart.content.v1.tar+gzip"
            operation: copy
          url: ${resourceChart.status.ociArtifact.sourceReference.registry}/${resourceChart.status.ociArtifact.sourceReference.repository} 
          ref:
            digest: ${resourceChart.status.ociArtifact.digest}
    - id: helmrelease
      template:
        apiVersion: helm.toolkit.fluxcd.io/v2
        kind: HelmRelease
        metadata:
          name: ${schema.spec.podinfo.releaseName}
        spec:
          releaseName: ${schema.spec.podinfo.releaseName}
          interval: 1m
          timeout: 5m
          chartRef:
            kind: OCIRepository
            name: ${ocirepository.metadata.name}
            namespace: default
          values:
            # Localisation
            image:
              repository: ${resourceImage.status.ociArtifact.sourceReference.registry}/${resourceImage.status.ociArtifact.sourceReference.repository}
              tag: ${resourceImage.status.ociArtifact.sourceReference.reference}
            # Configuration
            ui:
              message: ${schema.spec.podinfo.message}

@Skarlso
Copy link
Contributor

Skarlso commented Mar 31, 2025

If the image must be available in a specific environment, one could use the replication controller to move the component version to that environment.

The replication controller was all but archived.

This information can then be taken as CEL to pass the location for example to a FluxCD OCI Repository. Consider the following example of an ResourceGraphDefinition:

How would that work with modified resources by the ocm client during transfer? Which step would that be? So you declare a component version, you declare a target, and then deploy that all with Kro, and by the end there would be an end registry with a Status updated having the location of the endresult in a registry I assume?

Keep in mind that this all needs to work offline.

@frewilhelm
Copy link
Contributor Author

Note: A replication cannot be part of a RGD in the same CV that should be replicated with that replication.

@Skarlso
Copy link
Contributor

Skarlso commented Mar 31, 2025

We talked about this on slack. Outcome:

  • since loc/conf won't be part of the main flow anymore, the reason for the registry becomes not that great anymore
  • shared some historical reasons behind the local registry including DMZs where the registry was the target ( this can, of course be mitigated if the infrastructure maintainers run their own registry )
  • shared two concerns:
    • increase in burden for the infra maintainers ( kro + local registry in case of DMZ )
    • kro is really new in the game being about a year old might raise concerns with some clients who have strict environment policies

@frewilhelm
Copy link
Contributor Author

Spike is closed and the implementation will be implemented here #172

@github-project-automation github-project-automation bot moved this from 🏗 In Progress to 🍺 Done in OCM Backlog Board Apr 1, 2025
@ocmbot ocmbot bot added this to the 2025-Q2 milestone Apr 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🔒Closed
Development

No branches or pull requests

4 participants