Skip to content

Conversation

@KA-Takeuchi
Copy link
Contributor

This is an enhancement proposal for Dynamic Scoring Framework addon.
Please review this PR when you have a chance.

@openshift-ci openshift-ci bot requested review from deads2k and qiujian16 November 17, 2025 06:01
@KA-Takeuchi KA-Takeuchi changed the title Proposal for Dynamic Scoring Framework Adon Proposal for Dynamic Scoring Framework Addon Nov 17, 2025
@qiujian16
Copy link
Member

please run

git commit --amend -s
git push -f

to signoff the commit.

- Perform health checks on registered APIs and disable those that are not functioning properly.
- Manage API configurations (e.g., query range, frequency).
- In some cases, the API provider may want to enforce specific configurations; the framework should support applying these settings.
- In other cases, the PF operator may wish to set configurations according to their own requirements.
Copy link
Member

@qiujian16 qiujian16 Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does PF operator stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarification.


### Non-Goals

Implementation of the evaluation logic itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some examples or library would make it easier to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarification.


##### Scoring API Schema Propagation

An example PromQL query to configure in the Scoring API source is shown below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a mechanism to surface the misconfiguration of query? That is useful since such query could fail and the user needs to know the reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added error notification in Dynamic Scoring Framework Usage Flow Diagram.


1. it must have ```/healthz``` endpoint
2. it must have ```/scoring``` endpoint (schema follows above section)
3. it must have ```/config``` endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it implies user can either config through CR or use this endpoint? But this will leave some questions:

  1. how to handle conflict between CR based configuration and this endpoint?
  2. how this endpoint be authenticated and authzed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarification. I think it’s fine for the /config endpoint to be publicly accessible, so I don’t see a need for token protection. What’s your opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yeah it makes sense.


### Alternative 2: No use prometheus-compatible component

- Agents collect metrics directly from Kubernetes API or other sources (e.g. OpenTelemetry).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this as an valid option, since not every cluster has prometheus deployed.

Being able to support different metric sources seems like good item for beta phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your opinion. On the other hand, if we try to use OpenTelemetry directly, the Agent would need to include a mechanism to store the data, right?
Do you have any suggestions on what to use for data store?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends. If only transient metrics are used, persistence is not necessary. I think we can also configure persistence for otel collector. There are certainly some limitation of using otel, but it is much simpler and more lightweight than prometheus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. For now, I'm using Prometheus as the source, but in some cases I may want to change the source (none/another CR/OTel, etc.). So I'm considering adding DynamicScorer.spec.source.type to my implementation. Is this acceptable to you?

```yaml
spec:
mask:
- clusterName: "cluster1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this to define "Do not generate this score in this cluster"? With that shouldn't the cluster have a very low score so scheduler will not pick it? Why this need to be specifically configured? I think agent can find that this metrics does not exist and just generate a low default score?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added clarification. The mask is meant to reduce unnecessary access to the Scoring API. Make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm, it is not clear to me how would user decide that a certain score should not be reported. It seems like the user needs to at first find out that there is no certain metrics on a cluster then decide whether the score needs to be masked? But I agree this is useful in some cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. From my perspective, I was assuming a scenario where the user already owns on-premise servers and has a prior understanding of the cluster’s hardware specifications, so that’s why I was considering this kind of feature.
In a cloud environment, the needs might be a bit different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I think it is like a way to optimize. I think we will need to handle score if there is no certain hardware and mask is not set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when no mask is set, agents in each cluster will evaluate using all available scoring APIs.

@qiujian16
Copy link
Member

finish the 1st pass, I think the the content is pretty good.


6. **Visualize and utilize scores with Prometheus/Grafana/ResourceArrangement, etc.**

```plantuml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub does not offer native, built-in rendering of PlantUML diagrams. Would it be possible to replace this with the Mermaid format? It will be easier for others to read if they are interested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


A CR (Custom Resource) for registering scoring APIs. The fields are defined as follows:

```plantuml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@enduml
```

ScorerSummary is a flattened summary of DynamicScorer CR information and is distributed to managed clusters as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ScorerSummary be distributed directly to managed clusters or wrapped inside a configmap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wrapped inside a configmap.

KA-Takeuchi and others added 2 commits November 20, 2025 15:39
Removed superseded-by section from metadata.

Signed-off-by: Kazuma Takeuchi <[email protected]>

Implementation of the evaluation logic itself.
The evaluation logic is expected to be implemented outside the Dynamic Scoring Framework.
(But hte interfaces for Scoring APIs are defined in the framework.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hte -> the

spec:
description: A simple prediction scorer for time series data
scoreDestination: AddOnPlacementScore
scoreDimensionFormat: "${node}-${namespace}-${pod}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be a slice? since you can put multiple scores in addonPlacementScore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some notes about the behavior when there are multiple dimensions. Does this match your intention?


1. it must have ```/healthz``` endpoint
2. it must have ```/scoring``` endpoint (schema follows above section)
3. it must have ```/config``` endpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yeah it makes sense.

```yaml
spec:
mask:
- clusterName: "cluster1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm, it is not clear to me how would user decide that a certain score should not be reported. It seems like the user needs to at first find out that there is no certain metrics on a cluster then decide whether the score needs to be masked? But I agree this is useful in some cases


### Alternative 2: No use prometheus-compatible component

- Agents collect metrics directly from Kubernetes API or other sources (e.g. OpenTelemetry).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends. If only transient metrics are used, persistence is not necessary. I think we can also configure persistence for otel collector. There are certainly some limitation of using otel, but it is much simpler and more lightweight than prometheus.


#### DynamicScoringConfig Definition Details

DynamicScoringConfig is a CR that aggregates the current information of registered DynamicScorers and distributes it to managed clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be more explanation here regarding DynamicScoringConfig's structure and example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added it. In fact, the DynamicScoringConfig itself only contains fields like mask, so the important part is generating the list of ScorerSummary entries in the ConfigMap. Please check the flow diagram, and I’ll add more if needed.


A CR (Custom Resource) for registering scoring APIs. The fields are defined as follows:

```go
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to specify the required and optional fields here for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some notes after the code block. How does it look?

@@ -0,0 +1,14 @@
title: dynamic-scoring-framework-addon
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pls rename the 83 to 166 as 166 as the related issue number.

Signed-off-by: Kazuma Takeuchi <[email protected]>
@qiujian16
Copy link
Member

I think it is written well. I'd like this #165 (comment) to be clarified before merged

/approve

leave to @haoqing0110 for final review.

@openshift-ci
Copy link

openshift-ci bot commented Dec 1, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: KA-Takeuchi, qiujian16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Dec 1, 2025
@haoqing0110
Copy link
Member

@KA-Takeuchi Thank you for contributing this enhancement. Merging now!
/lgtm

@haoqing0110
Copy link
Member

/hold

@openshift-merge-bot openshift-merge-bot bot merged commit 7fd6710 into open-cluster-management-io:main Dec 2, 2025
2 checks passed
@haoqing0110
Copy link
Member

/unhold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants