-
Notifications
You must be signed in to change notification settings - Fork 292
[WIP]Fix bug with CDIDataImportCronOutdated alert #3867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP]Fix bug with CDIDataImportCronOutdated alert #3867
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, please see inline comments.
{ | ||
Alert: "CDIDataImportCronOutdated", | ||
Expr: intstr.FromString(`sum by(ns,cron_name) (kubevirt_cdi_dataimportcron_outdated{pending="false"}) > 0`), | ||
Expr: intstr.FromString(`sum by(namespace,cron_name) (kubevirt_cdi_dataimportcron_outdated{pending="false", namespace="openshift-virtualization-os-images"}) > 0`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSP controls the name of that namespace, it isn't limited specifically to OpenShift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean. namespace should reprt the cron namespace. The current metric has a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus will overwrite the default namespace with the one we supply.
}, | ||
{ | ||
Alert: "CDIUserDefinedDataImportCronOutdated", | ||
Expr: intstr.FromString(`sum by(namespace,cron_name) (kubevirt_cdi_dataimportcron_outdated{pending="false", namespace!="openshift-virtualization-os-images}) > 0`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
Updated CDIDataImportCronOutdated to fire only if the issue is related to the Pre-defined golden images. Add a CDIUserDefinedDataImportCronOutdated alert for user defined DIC that will not impact the operator health. Updated the namspabe label name fro ns to namespace, since each alert should report a namespace. Signed-off-by: Shirly Radco <[email protected]>
02f25b4
to
954f4ce
Compare
@sradco: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @akalenyu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, my concern is about making CDI aware of those namespaces
{ | ||
Alert: "CDIDataImportCronOutdated", | ||
Expr: intstr.FromString(`sum by(ns,cron_name) (kubevirt_cdi_dataimportcron_outdated{pending="false"}) > 0`), | ||
Expr: intstr.FromString(`sum by(namespace,cron_name) (kubevirt_cdi_dataimportcron_outdated{pending="false", namespace=~"openshift-virtualization-os-images|kubevirt-os-images"}) > 0`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am struggling a bit with CDI having to "know" about "golden image namespaces" or "special dataimportcrons".
I think we should move this definition to other operators that encapsulate this knowledge.
Alternatively, we could also label the metric with this information though I don't know if anything gives it away easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I was told the images in openshift-virtualization-os-images|kubevirt-os-images are the golden images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nunnatsa am I correct?
release note has a typo looks good to me |
We had to split the alert, but it is still valuable for all namespaces. |
Hi @Acedus, I will appreciate a review of this PR. |
I had some more time to think about it and I'm a bit split on whether the alert in its current form should even exist in CDI to begin with. As @akalenyu correctly stated, golden images is an SSP concept, it isn't directly part of CDI. With that said, the other alert which targets ANY namespace besides the golden images one could belong to CDI. I'm leaning more towards moving these alerts to SSP, seeing as they have the relevant context. |
@arnongilboa @nunnatsa please review. What do you think the comment that @Acedus made about moving these alerts to SSP? |
@sradco - the whole concept is very hard to implement, because the relevant information is spread across three different components. First, the namespace is not the right way to distinguish between pre-prepared and user-defined images, for two reasons:
The only component that "knows" if an image is pre-defined or user-defined is the HCO. HCO adds the images (the DataImportCronTemplate objects) to SSP, with no indication if they are pre-defined or user defined, or a modified pre-defined. BTW, this information is reflected in the HyperConverged status. However, HCO has no information about the DataImportCron CRs, the DataSources or the VolumeSnapshots. HCO does not know them, nor watch or read them. |
I think this may be solvable by adding a label to golden image DICs and adding that label to the timeseries used to make the alerts. That way we allow to shift the alerts into SSP while also allowing users to further discern between default images and user-created ones. |
@Acedus @arnongilboa @machadovilaca Since this has a high impact on the operator health indicator, I created a simpler PR #3885 to address minor changes. I will keep this one open and create an RFE for the suggested changes. |
+1 |
Agree. Since the source of all the pre-defined (common) DICs is a static file in the HCO image, we can do it w/o code change in HCO/SSP, IIUC. We can use either label or an annotation. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
Updated CDIDataImportCronOutdated to fire
only if the issue is related to the
Pre-defined golden images.
Add a CDIUserDefinedDataImportCronOutdated alert
for user defined DIC that will not impact the
operator health.
Updated the namspabe label name fro ns to
namespace, since each alert should report a
namespace.
Signed-off-by: Shirly Radco [email protected]
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #https://issues.redhat.com/browse/CNV-67060
Special notes for your reviewer:
Release note: