Skip to content

Conversation

sradco
Copy link
Contributor

@sradco sradco commented Aug 17, 2025

What this PR does / why we need it:
Updated CDIDataImportCronOutdated to fire
only if the issue is related to the
Pre-defined golden images.

Add a CDIUserDefinedDataImportCronOutdated alert
for user defined DIC that will not impact the
operator health.

Updated the namspabe label name fro ns to
namespace, since each alert should report a
namespace.

Signed-off-by: Shirly Radco [email protected]

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #https://issues.redhat.com/browse/CNV-67060

Special notes for your reviewer:

Release note:

Added CDIUserDefinedDataImportCronOutdated alert and updated CDIDataImportCronOutdated condition so it will fire only for pre-defined golden images.

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Aug 17, 2025
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign akalenyu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Aug 17, 2025

Coverage Status

coverage: 59.319%. remained the same
when pulling 954f4ce on sradco:fix_CDIDataVolumeUnusualRestartCount_alert
into 42df94a on kubevirt:main.

Copy link
Contributor

@Acedus Acedus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, please see inline comments.

{
Alert: "CDIDataImportCronOutdated",
Expr: intstr.FromString(`sum by(ns,cron_name) (kubevirt_cdi_dataimportcron_outdated{pending="false"}) > 0`),
Expr: intstr.FromString(`sum by(namespace,cron_name) (kubevirt_cdi_dataimportcron_outdated{pending="false", namespace="openshift-virtualization-os-images"}) > 0`),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSP controls the name of that namespace, it isn't limited specifically to OpenShift.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. namespace should reprt the cron namespace. The current metric has a bug.

Copy link
Contributor Author

@sradco sradco Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus will overwrite the default namespace with the one we supply.

},
{
Alert: "CDIUserDefinedDataImportCronOutdated",
Expr: intstr.FromString(`sum by(namespace,cron_name) (kubevirt_cdi_dataimportcron_outdated{pending="false", namespace!="openshift-virtualization-os-images}) > 0`),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Updated CDIDataImportCronOutdated to fire
only if the issue is related to the
Pre-defined golden images.

Add a CDIUserDefinedDataImportCronOutdated alert
for user defined DIC that will not impact the
operator health.

Updated the namspabe label name fro ns to
namespace, since each alert should report a
namespace.

Signed-off-by: Shirly Radco <[email protected]>
@sradco sradco force-pushed the fix_CDIDataVolumeUnusualRestartCount_alert branch from 02f25b4 to 954f4ce Compare August 18, 2025 15:27
@kubevirt-bot
Copy link
Contributor

kubevirt-bot commented Aug 18, 2025

@sradco: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-containerized-data-importer-e2e-destructive 954f4ce link true /test pull-containerized-data-importer-e2e-destructive

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@Acedus Acedus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @akalenyu

@kubevirt-bot kubevirt-bot requested a review from akalenyu August 20, 2025 06:27
@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 20, 2025
Copy link
Collaborator

@akalenyu akalenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, my concern is about making CDI aware of those namespaces

{
Alert: "CDIDataImportCronOutdated",
Expr: intstr.FromString(`sum by(ns,cron_name) (kubevirt_cdi_dataimportcron_outdated{pending="false"}) > 0`),
Expr: intstr.FromString(`sum by(namespace,cron_name) (kubevirt_cdi_dataimportcron_outdated{pending="false", namespace=~"openshift-virtualization-os-images|kubevirt-os-images"}) > 0`),
Copy link
Collaborator

@akalenyu akalenyu Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am struggling a bit with CDI having to "know" about "golden image namespaces" or "special dataimportcrons".
I think we should move this definition to other operators that encapsulate this knowledge.

Alternatively, we could also label the metric with this information though I don't know if anything gives it away easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I was told the images in openshift-virtualization-os-images|kubevirt-os-images are the golden images.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nunnatsa am I correct?

@machadovilaca
Copy link
Member

machadovilaca commented Aug 21, 2025

release note has a typo

looks good to me
but i don't think the PR title is appropriate, what was the bug? the namespace label? but that doesn't justify the addition of a new alert in this PR

@sradco
Copy link
Contributor Author

sradco commented Aug 21, 2025

release note has a typo
Fixed
looks good to me
but i don't think the PR title is appropriate, what was the bug? the namespace label? but that doesn't justify the addition of a new alert in this PR

We had to split the alert, but it is still valuable for all namespaces.
The difference is the audience for the alerts.
One alert impact the operator health and the other doesnt

@sradco
Copy link
Contributor Author

sradco commented Aug 24, 2025

Hi @Acedus, I will appreciate a review of this PR.

@Acedus
Copy link
Contributor

Acedus commented Aug 24, 2025

I had some more time to think about it and I'm a bit split on whether the alert in its current form should even exist in CDI to begin with. As @akalenyu correctly stated, golden images is an SSP concept, it isn't directly part of CDI.

With that said, the other alert which targets ANY namespace besides the golden images one could belong to CDI.

I'm leaning more towards moving these alerts to SSP, seeing as they have the relevant context.

@sradco
Copy link
Contributor Author

sradco commented Aug 26, 2025

@arnongilboa @nunnatsa please review. What do you think the comment that @Acedus made about moving these alerts to SSP?

@nunnatsa
Copy link
Contributor

nunnatsa commented Aug 27, 2025

@sradco - the whole concept is very hard to implement, because the relevant information is spread across three different components.

First, the namespace is not the right way to distinguish between pre-prepared and user-defined images, for two reasons:

  1. By default, user-defined images will be created on the *-os-images namespace as well.
  2. The images namespace is configurable (in the HyperConverged CR) and may be changed by the user.

The only component that "knows" if an image is pre-defined or user-defined is the HCO. HCO adds the images (the DataImportCronTemplate objects) to SSP, with no indication if they are pre-defined or user defined, or a modified pre-defined. BTW, this information is reflected in the HyperConverged status.

However, HCO has no information about the DataImportCron CRs, the DataSources or the VolumeSnapshots. HCO does not know them, nor watch or read them.

@Acedus
Copy link
Contributor

Acedus commented Aug 27, 2025

I think this may be solvable by adding a label to golden image DICs and adding that label to the timeseries used to make the alerts. That way we allow to shift the alerts into SSP while also allowing users to further discern between default images and user-created ones.

@sradco
Copy link
Contributor Author

sradco commented Aug 27, 2025

@Acedus @arnongilboa @machadovilaca Since this has a high impact on the operator health indicator, I created a simpler PR #3885 to address minor changes.

I will keep this one open and create an RFE for the suggested changes.

@akalenyu
Copy link
Collaborator

akalenyu commented Aug 27, 2025

I think this may be solvable by adding a label to golden image DICs and adding that label to the timeseries used to make the alerts. That way we allow to shift the alerts into SSP while also allowing users to further discern between default images and user-created ones.

+1
Maybe we'd have to iron out some details like whether the label is excluding them from the series or the other way around but overall this is a solid approach imo

@sradco sradco changed the title Fix bug with CDIDataImportCronOutdated alert [WIP]Fix bug with CDIDataImportCronOutdated alert Aug 27, 2025
@kubevirt-bot kubevirt-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 27, 2025
@nunnatsa
Copy link
Contributor

I think this may be solvable by adding a label to golden image DICs and adding that label to the timeseries used to make the alerts. That way we allow to shift the alerts into SSP while also allowing users to further discern between default images and user-created ones.

+1 Maybe we'd have to iron out some details like whether the label is excluding them from the series or the other way around but overall this is a solid approach imo

Agree. Since the source of all the pre-defined (common) DICs is a static file in the HCO image, we can do it w/o code change in HCO/SSP, IIUC.

We can use either label or an annotation.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 5, 2025
@kubevirt-bot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants