Skip to content

Aggregate CPU and Memory accross all nodes instead of per node to fix AWS jobs#3844

Open
ronaldngounou wants to merge 1 commit intokubernetes:masterfrom
ronaldngounou:fix-nodename-aggregation
Open

Aggregate CPU and Memory accross all nodes instead of per node to fix AWS jobs#3844
ronaldngounou wants to merge 1 commit intokubernetes:masterfrom
ronaldngounou:fix-nodename-aggregation

Conversation

@ronaldngounou
Copy link
Member

@ronaldngounou ronaldngounou commented Feb 24, 2026

Overview

Contributes to issue #3843

Contributes to naming inconsistency issue with the AWS jobs, preventing LoadResources to be meaningful as it is not taking the average across all the nodes

2 possibilities were considered to fix:
Option 1: Arrange for the AWS nodes to have well-defined names
Option 2. Change the metrics gathering/display to just average over all the nodes without regard to their names

GCE tests don't have the same issue because they have a more consistent naming for the nodes whereas AWS logs the instance-id in the nodename. It makes it insconsisent to see the resource usage. Therefore, this PR averages CPU/Memory across all the nodes.

In this PR, option 2 was selected.

Updated parseResourceUsageData to take the average accross the nodes

Testing

> go test parser.go parser_test.go
ok      command-line-arguments  0.370s

Deployed perfdash on localhost:8080 after following https://github.com/kubernetes/perf-tests/blob/master/perfdash/README.md

make push
make run

[done] Confirming that CL2 generates the desired result files:

The artifacts are not containing the new resourceUsageAggregate struct added because this change is not modifying the type resourceUsagePercentiles map[string][]resourceUsages rather, it adds type usageAggregateAtPercentiles map[string]resourceUsageAggregate which is the new struct computed. I am not modifying the struct below, which contains the resources.

usage[name][percentile] = resourceUsage{cpu, memory}

However, I after running the perfdash locally, I'm able to see the aggregation across all the nodes.

Result after

URL = http://localhost:8080/#/?jobname=gce-5000Nodes&metriccategoryname=E2E&metricname=LoadResources&Resource=CPU_Average&PodName=all-nodes after the PR
image

Before

The all-nodes is more useful for AWS jobs as the namings differ.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 24, 2026
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 24, 2026
@ronaldngounou
Copy link
Member Author

/test pull-perf-tests-ec2-master-scale-performance-5000

@ronaldngounou ronaldngounou marked this pull request as draft February 24, 2026 08:49
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2026
@ronaldngounou ronaldngounou marked this pull request as ready for review February 24, 2026 08:49
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2026
@ronaldngounou ronaldngounou marked this pull request as draft February 24, 2026 08:50
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2026
@ronaldngounou ronaldngounou changed the title Aggregate CPU and Memory accross all nodes instead of per node Aggregate CPU and Memory accross all nodes instead of per node to fix AWS jobs Feb 24, 2026
@ronaldngounou
Copy link
Member Author

/test pull-perf-tests-ec2-master-scale-performance-100

@ronaldngounou
Copy link
Member Author

Running only ec2-master-scale-performance-100 is sufficient. We only run 5000 nodes test to debug performance issues.

@ronaldngounou
Copy link
Member Author

/assign @mengqiy

@ronaldngounou ronaldngounou marked this pull request as ready for review February 24, 2026 09:49
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2026
@aojea
Copy link
Member

aojea commented Feb 24, 2026

Example on what @ronaldngounou is mentioning

GCE https://perf-dash.k8s.io/#/?jobname=gce-100Nodes-master&metriccategoryname=E2E&metricname=LoadResources&PodName=kube-proxy-nodes-us-east1-c%2Fkube-proxy&Resource=CPU

AWS https://perf-dash.k8s.io/#/?jobname=aws-100Nodes&metriccategoryname=E2E&metricname=LoadResources&PodName=i-056392ead8e6be597%2Fkubelet&Resource=CPU

the result is that we can not see in aws a trend, since each execution will create different nodes.

@ronaldngounou can we make this change so it is backward compatible and instead of replacing the existing resources, we add a new "fake" resource that we call "aggregated" or "summary" or something like that?
I can not understand from the code if we can break other reporting that depend on the node names

/assign @wojtek-t @mborsz

@ronaldngounou
Copy link
Member Author

Sounds good @aojea. I will address that note and update my PR.

@ronaldngounou ronaldngounou force-pushed the fix-nodename-aggregation branch from db52e1a to bcd2372 Compare March 3, 2026 03:16
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 3, 2026
@ronaldngounou ronaldngounou reopened this Mar 3, 2026
@ronaldngounou ronaldngounou marked this pull request as draft March 3, 2026 03:19
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 3, 2026
@ronaldngounou ronaldngounou force-pushed the fix-nodename-aggregation branch from bc81329 to a0c7cac Compare March 3, 2026 03:30
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 3, 2026
@ronaldngounou ronaldngounou marked this pull request as ready for review March 3, 2026 03:30
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 3, 2026
@ronaldngounou
Copy link
Member Author

ronaldngounou commented Mar 3, 2026

@wojtek-t @mborsz PR is ready for review please. Thank you

@ronaldngounou ronaldngounou force-pushed the fix-nodename-aggregation branch from a0c7cac to 4d72c6d Compare March 3, 2026 04:03
@ronaldngounou
Copy link
Member Author

/test pull-perf-tests-ec2-master-scale-performance-100

@aojea
Copy link
Member

aojea commented Mar 3, 2026

/lgtm

Thanks

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea, ronaldngounou
Once this PR has been reviewed and has the lgtm label, please ask for approval from mborsz. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ronaldngounou
Copy link
Member Author

/test pull-perf-tests-ec2-master-scale-performance-100

…WS jobs

Signed-off-by: Ronald Ngounou <ronald.ngounou@yahoo.com>
@ronaldngounou ronaldngounou force-pushed the fix-nodename-aggregation branch from 4d72c6d to 2e0371a Compare March 6, 2026 05:59
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2026
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants