-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster-state manifest drift correction #186
Comments
Tagging a few folks as per the PRs to the reconcile helmchart part of code @hiddeco @seaneagan . |
The helm-controller does not support this at present, nor did the Helm Operator. I have been playing around with proof of concepts to determine changes based on the manifest blob that is stored in the Helm storage based on 2 and 3-way merges, but ran into some issues around mutating cluster webhooks that may modify the object when it is applied on the cluster by Helm. That being said (and done), I do want to eventually support this, but do not have a time frame on when. |
@hiddeco any updates, just realised that a deployment was deleted but helm was yay happy. |
we are looking forward to have this feature!! Any updates? |
Facing the same issue. Reconciliation does not work with resources deployed by Helm. |
i'm also looking forward to have this feature! what's the current status? |
You could try to do a cronjob with manual reconciliation
|
The only way currently to restore the helmrelease to the defined state is to reinstall or upgrade it, or am I missing something? |
I deleted a crd (and all its crs) and recreated it, however helm controller does not recreate the objects from the helm release. Now my cluster is in an inconsistent state and it does not recover. Even if I call |
After consultation with the Helm maintainers, the blue print for this is in: https://github.com/bacongobbler/helm-patchdiff/blob/8cbbc9b4716330fd4e69b5fea93320fa5da1c352/main.go#L289-L294. This could be implemented after the planned work in https://github.com/fluxcd/helm-controller/milestone/1 has landed (and our testing suite has been strengthened). |
|
we are looking forward to have this feature, Any updates about when it could be available? |
Should be there in the docs: the moment you start using HelmRelease = you lose all the GitOps flow |
Any news about this issue? @hiddeco is there any plan for supporting reconciliation in case of changes manually applied by user ? |
This topic comes up a lot, I wanted to reiterate these thoughts I placed on #267 originally, but while they don't exactly belong on that discussion, they can perhaps go here: Helm uses a 3-way merge to incorporate: (1) the old manifest, (2) its live state on the cluster, and (3) the new manifest. Helm also provides various opportunities for the installing agent to manipulate the state before, around and after install, both within the chart (lifecycle hooks) that are permitted to be non-deterministic (and so may have outputs that are not declaratively formed), and also outside of the chart, with Post Render behavior, that outside of Flux, would also be capable of non-deterministic behavior. Simplified: The installer can create secrets on the first run ( Though Flux makes the postrender behavior declarative, it's still running the Helm code under the hood, which does not assume declarativeness, and that means we can't assume anything that Helm itself would not have assumed or guaranteed. Every upgrade produces a new So when you have just installed for the first time and you have a handful of Helm charts, your cluster-wide secrets may look like this:
Three-way diffs are complicated, even more complicated than three-way merges. Now imagine instead of comparing the inputs to the outputs, Helm controller just runs an upgrade each time, producing a new revision and a new versioned secret. The two secrets shown here for I'm not sure if philosophically this can mesh with the "no imperative magic" philosophy of GitOps, that expects all changes to be declarative, but if there was a safe place to put such an annotation in the spec that was guaranteed not to have any other side-effect on anything down-stream other than to force a single reconciliation that would definitely result in an Upgrade as long as it succeeds, it would certainly make sense for the CLI to include a way to trigger that behavior. Part of the problem is that you can't guarantee nobody is watching. We usually recommend people who are struggling with this issue to find the https://fluxcd.io/docs/components/helm/helmreleases/#configuring-failure-remediation The least wrong way to approach this now, is to include some values from a configMap that will not be read by the chart: https://fluxcd.io/docs/guides/helmreleases/#refer-to-values-in-configmaps-generated-with-kustomize and update the configmap with a new nonce when you want to do the force-upgrade. Some users have reported they can trigger an upgrade from the failed state when running Understanding that none of this directly addresses the original topic "why doesn't Helm Controller resolve drift in my cluster" it seems that many people who are asking for that behavior, really are willing to use a "Closed Loop" and don't expect drift to come into their cluster from random places, but they are just looking for infinite retries and failure remediation, and those are all options which can be configured... though it is not the default behavior to retry infinitely, and for some very good reasons. |
My philosophical answer so far: rewrite all the charts (using |
if fluxcd works like this, then how does flagger handle the blue/green deployment? |
Flagger uses Git to facilitate rollback, rather than Helm. https://github.com/fluxcd/flagger/blob/main/docs/gitbook/tutorials/canary-helm-gitops.md Although you can use Flagger and Helm together, the canary in this example is inside of the Helm chart, not around or outside of the Helm Release. So when the canary fails, the traffic continues routing to the old primary which is not affected or upgraded (it's the prior state of the Helm release's deployments, associated refs from the canary spec, HPAs, etc.). You move on to the next state the same as with any Flagger canary, by either git-reverting and allowing a new canary to run again with the old release, or perhaps more likely and more beneficial, by letting the Canary keep the stable release going, and pushing another release after this with the fix to whatever issue you observed in the canary. I don't think you really would likely want to configure automatic rollback or other failure remediations together with Flagger. Actually, I'm pretty sure you'd want to disable remediation in that case. |
This is on my to-do list as part of an overhaul of the general logic of the reconciler. Given the changes that at times occur to my day-to-day priorities, I can not provide an ETA at this time other than "next (next) minor release with more bigger changes". |
If k8s kustomize supported a new option in configMapGenerator to instruct "kustomize build" to set a new .metadata.name for the generated ConfigMap object each time it is invoked, even when the ConfigMap's .data remains the same (e.g. adding some timestamp suffix to the ConfigMap object name)) that would really help here: getting a HelmRelease reconciliation each time the Flux2 Kustomization manifest wrapping the HelmRelease objct is reconciled (as set by its .spec.interval)
This would require low effort from Flux2, i.e. compiling it against the new kustomize pkgs (provided from k8s guys) supporting that new configMapGenerator option. Could this be a possible solution to try here? |
If it's a new feature from Kustomize upstream, I'd imagine it will be supported by Flux if possible, but if it acts as you describe I think it's not going to have the intended effect, instead something more like a new deployment rev or new instance of the job every single I guess that you wanted this in order to achieve drift correction with Helm only with the cadence of Kustomize controller's reconciliation, since that is the topic here. Hypothetically it would be possible, with some of the memory and performance advancements I think Helm controller today is a lot better equipped for this than Helm controller at any time from 3-12 months ago thanks to caching and other advances. Not sure this proposed approach is the way to go about it though. I'm not sure if the result will likely be good for Helm secret storage as the retention will run any meaningful changes off the history in short order with an endless stream of near-noop releases, having one being promoted every interval. The design of the controller is intentional about not running up the release counter when nothing really has changed, so in the event it's needed, Helm rollbacks from Helm secret will still be meaningful and useful more than 10 minutes after the fact of a release. Since there are changes, and revisions are immutable, the new configmap in the definition of each container will likely also trigger a rolling of the deployment or a retrigger of the job if any, etc. for various reasons I think this solution as you propose is not ideal, a better solution would have ideally found a way to achieve drift detection without this side-effect of spamming the release history and potentially also with a better awareness of the Helm lifecycle hooks (maybe even user-facing control of whether re-executions should retrigger the hooks or not, I don't know how granularly this is possible or even if at all). I'm not sure how much of that's planned now, but it is in the milestone for Flux 2.0 GA so I think broadly we can count on this hopefully being resolved in a way that is a bit more ergonomic and no-surprises than adding a purely-random-suffix feature from Kustomize upstream. |
Yes, I am looking for an automated way (i.e. without the need for any manual "git commit" to be applied) to make the If I run a Note the ConfgMap I proposes to get a new name for each time flux2 Kustomization is reconciled (as instructed by its P.S: Just to clarify: I do not know about any intention from k8s Kustomize guys to add something like this in the configMapGenerator (it is nothing more than a "possible request" to them). |
Actually I can simulate all this adding, to the k8s kustomization.yaml's configMapGenerator two files:
As the HelmRelease .spec.valuesFrom points to a different ConfigMap each time I change (and git commit) the content of "discard,yaml" (once the flux2 kustomization reconciliation runs) => a new reconciliation is run by the helm-controller on that updated HelmRelease object and so kubectl changes applied on the API objects "owned" by that Helm release are "repaired" even when keeping untouched the referenced Helm chart version ( The k8s Kustomize configMapGenerator feature commented above would be only a "trick" to be able to get the "desired" change in the HelmRelease object |
@kingdonb you wrote above:
I do not see how this proposed behavior could spoil the current installRemediation / upgradeRemediation behavior .... (???) |
Additionally, as part of the tests I am running, I just modified the .spec.interval of a HelmRelease object and it triggered a reconciliation from the helm-controller (I understand triggered from the mismatch between the HelmRelease object's .metadata.genetion and the .status.observedGeneration) what makes sense to me => a But then all the arguments against running a HelmRelease reconciliation when chart and values are kept the same could apply here ... but I think we all agree the reconciliation applied by the helm-controller because of whatever change in the .spec stanza is right in this case. Why then it would be wrong to do the same periodically (if so set/requested in an enhanced HelmRelease object) to correct possible drifts and so getting a perfect also at Helm level GitOps handling from Flux2? |
This is not the only thing we have to worry about nor the only functional use of the helm history. Strict compatibility with the Helm client is a highly important goal for Helm Controller and we have been able to preserve that strong compatibility until now. We have this guarantee for users that they can come and go from Helm Controller as they please, and no other GitOps tool offers this today; the Helm history is maintained exactly as if you created a release using the Users can execute a manual rollback after the release is marked
Drift detection should not break the function of If Helm Controller is going to perform drift detection and preserve the state of each release according to GitOps, it must not break this invariant in the process. That's all I'm imputing here, I am not the expert on this and I'm simply commenting as a passenger, not as a decision maker. I think the decision has already been made to implement this. I do not know the details of those decisions. However, I know what will go wrong if we simply run upgrade on every interval; please take a look at We had a bit of discussion about this at the most recent Flux Dev Meeting in public, it was recorded, and should appear on https://www.youtube.com/channel/UCoZxt-YMhGHb20ZkvcCc5KA soon (but I see we are a bit behind our backlog for uploading these recordings.) The short version is, the feature is planned to be implemented, and there are these hurdles. |
Thanks @kingdonb for the time you are putting on this (really appreciated ;-) Yes, I know about the per-Helm_release_revision k8s Secret objects and the But I am more than happy if a coming flux2 (helm-controller) release is solving it in a different way that does not add those additional drawbacks ;-) Thanks again for sharing your view and future flux2/helm-controller plans ! |
I am also looking for this feature implemented, as automatic reconciliation was the main feature due to which we started using Flux for our K8s deployments. I came to know about this recently when some one manually scaled down the replica count of a deployment to 0 & flux didn't reconcile it back to the previous state, until we suspend/resume the HelmRelease (would be good if we can clearly call this out in Flux Docs). However if I am ok,
Do you see any other challenges/unforeseen issues in suspending the HelmReleases & resuming it periodically using a script. Since this is already getting implemented in Flux 2.0, once the feature is out, I can just discard this script & start using Flux native feature. Since we are having lot of deployments which are managed via Flux HelmRelease, manually adding/reverting the configMap option shared above might be difficult (or making sure that everyone remember to add it). |
That sounds like a solid understanding of the issues at play 👍
Generally the main issue is that someone is making changes outside of the gitops delivery, which is more like what I'd consider a root cause here. Does this happen all the time? I tell people when they're using GitOps, they should treat You should have monitoring in place, such that when this person has scaled down the important deployment, an SRE/op gets notified and someone has to acknowledge the issue. If you have monitoring but it doesn't let you know when this important service has no backends/endpoints then it's an issue that should be resolved on the monitoring side. When you follow the Flux prometheus guide, you get a lot of these building blocks – I'm not sure how one configures Prometheus to let you know when a service has no backends, but I'm sure there must be a way. (Wish I could be more helpful there...) The script approach doesn't have any drawbacks I can think of other than the ones you've mentioned. There is some performance impact, because each HelmRelease upgrade has to load the chart and will consume a lot of resources while the upgrade is in progress. I definitely would expect the script to run every (N) hours rather than every (N) minutes for this reason. But if you are on the latest version of Flux, many of the performance issues will have been mitigated. It really depends a lot on how your sources are arranged and whether you've adopted the new The monitoring that I mentioned will give you an idea of how great is the performance impact, if you've configured it according to the Flux monitoring guide. This change supporting drift correction in Helm Controller is expected Real Soon Now, so please be prepared to throw the script away very soon. 🥸👩💻 |
@kingdonb do you know/have any estimation on when this feature could be in Flux2's helm-controller? |
In flux v1, if a deployment spec was known to have been manually edited in the cluster, we would Is this still the way flux2 works? |
@benjimin yes it works too + there's a sequence of |
And there is a third way, for when you really don't want to incur downtime, if you have a failed release with |
This feature is planned to be experimentally supported in the next minor release, after which it will be enabled by default once we have landed things like the long due #532 (and follow ups on this). The reason it will be first experimentally supported is that:
On a second note, I want to let you know that for the foreseeable future (or unless things of utter priority come up) the helm-controller now has my full attention. Which took some time after we reshaped our roadmap to focus on the Flux Source APIs first (and stop running around like crazy), but means that I am at present "ahead of schedule". |
Closing this in favor of #643. Please provide (short) feedback there, or create specific issues for problems you're experiencing. Thank you all! |
It seems that above solution doesn't work anymore, but using Thank you! |
Suppose
HelmRelease
object deploys a helm chart that contains a deployment. The pod resource can now be updated by the user. Until there is a chart version update inHelmRelease
object, the pod will continue to have the manually updated setting. I'm looking for a way to force reconcile the resources deployed by theHelmRelease
after every x minutes. Does helm-controller support this?There are a few other intervals like one used by HelmRepository and HelmRelease but they don't seem to reconcile a previous succeeded release unless there is a new version in the HelmRelease.
requeueDependency
interval is only responsible for reconcile if the dependencies for that helmrelease failed. These seem to be unrelated to my scenario.The text was updated successfully, but these errors were encountered: