Enable kubernetes_node_scale benchmark (up to 5k nodes) on AWS EKS with Karpenter#6512
Enable kubernetes_node_scale benchmark (up to 5k nodes) on AWS EKS with Karpenter#6512kiryl-filatau wants to merge 17 commits intoGoogleCloudPlatform:masterfrom
Conversation
| # Output can be quite large, so we'll conditionally suppress it. | ||
| ['get', resource_type, '-o', 'json'], | ||
| timeout=60 * 5, # 5 minutes for large clusters (e.g. 1000 pods) | ||
| suppress_logging=NUM_PODS.value > 20, |
| def _PostCreate(self): | ||
| """Performs post-creation steps for the cluster.""" | ||
| super()._PostCreate() | ||
| # Karpenter controller resources: default 1/1Gi; scale up when node_scale target is set. |
There was a problem hiding this comment.
Can we just not specify anything & let Karpenter decide? Or is this indeed necessary? It seems clever but a little annoying / bad user experience by Karpenter.
There was a problem hiding this comment.
These are the resources for the Karpenter controller pod (the node where Karpenter itself runs). Karpenter doesn’t manage that node, so it can’t “decide” these values, we have to set them. For runs with ~10 nodes, 1/1Gi is sufficient; we only increase when node_scale is 500+ or 1000+.
| 'v' | ||
| + full_version.strip().strip('"').split(f'{self.cluster_version}-v')[1] | ||
| ) | ||
| # NodePool CPU limit: scale with benchmark target (nodes * 2 + 5%), min 1000. |
There was a problem hiding this comment.
Does the machine type matter here as well? If I am using a larger machine type, do I need to also set a larger cpu limit? This again seems a little annoying to have to set manually (but maybe makes senses given Karpenter can be machine type agnostic).
There was a problem hiding this comment.
Makes sense to include machine type adjustment, I’ll think about how to cover it.
Thanks.
There was a problem hiding this comment.
I added the eks_karpenter_limits_vcpu_per_node flag so the Karpenter NodePool CPU limit can be tuned when nodes use more than 2 vCPUs. The default remains 2 (same behavior as before)
| suppress_failure=lambda stdout, stderr, retcode: ( | ||
| 'no matching resources found' in stderr.lower() | ||
| or 'timed out' in stderr.lower() | ||
| or 'context deadline exceeded' in stderr.lower() |
There was a problem hiding this comment.
These look very similar to the RETRYABLE_KUBECTL_ERRORS list:
Just use kubectl.RunRetryableKubectlCommmand instead & get these for free. If that code is missing some of these (like 'timed out') then consider adding. It looks suppress_failure is supported too, so you can mix both - which would probably be good for 'no matching resources found' as that sounds like a wait/this command specific error message to ignore.
There was a problem hiding this comment.
@hubatish
Updated: EKS cleanup now uses RunRetryableKubectlCommand with suppress_failure only for "no resources found" style messages, retryable list extended and matching is case insensitive, please check.
| ), | ||
| ) | ||
| max_retries = 5 | ||
| backoff_seconds = 10 |
There was a problem hiding this comment.
While this backoff logic looks pretty reasonable, prefer reusing backoff logic in vm_util.Retry. Which means moving this code to a subfunction & adding said decorator.
There was a problem hiding this comment.
I have updated the code, please take look
| """Stop watching the cluster for node add/remove events.""" | ||
| polled_events = self._cluster.GetEvents() | ||
|
|
||
| # Resolve machine type only for current nodes; use "unknown" for the rest. |
There was a problem hiding this comment.
O this makes sense. Was this causing the cluster to take a long time querying everything?
There was a problem hiding this comment.
Yep, it was the main reason.
| if name in _current_node_names: | ||
| machine_type = _GetMachineTypeFromNodeName(self._cluster, name) | ||
| else: | ||
| machine_type = "unknown" |
There was a problem hiding this comment.
Something around here is probably what is causing the TypeError.
There was a problem hiding this comment.
Checks have passed
There was a problem hiding this comment.
So here you use unknown.. I wonder if instead a random different machine's type would be better. likely in a big scaling scenario they'll all use the same one.
There was a problem hiding this comment.
It will anyway unknown, as on the moment of gathering the info the nodes from scaleUP1 were already removed
…ded and matched case-insensitively
| ) | ||
| if k8s_cluster is None: | ||
| if not isinstance( | ||
| benchmark_spec.container_cluster, kubernetes_cluster.KubernetesCluster |
There was a problem hiding this comment.
I was gonna say swap to raise instead of return, but the return if None seems quite reasonable.
| 'Default value - do not install unless explicitly requested', | ||
| ) | ||
| flags.DEFINE_integer( | ||
| 'eks_karpenter_limits_vcpu_per_node', |
There was a problem hiding this comment.
Use flagholder: https://absl.readthedocs.io/en/latest/absl.flags.html#absl.flags.FlagHolder
Also not sure if this is generally the right spot for these. Ideally probably both should go in config_overrides, with this one maybe being set cpu size from vm_spec & the other coming in a follow up cl.
| if name in _current_node_names: | ||
| machine_type = _GetMachineTypeFromNodeName(self._cluster, name) | ||
| else: | ||
| machine_type = "unknown" |
There was a problem hiding this comment.
So here you use unknown.. I wonder if instead a random different machine's type would be better. likely in a big scaling scenario they'll all use the same one.
…, quieter node_scale logs output
Summary
Enables running the kubernetes_node_scale benchmark (0→5k→0→5k nodes) on AWS EKS with Karpenter. The benchmark scales a deployment with pod anti-affinity, measures scale-up, scale-down, and a second scale-up, then tears down the cluster.
Main changes
Kubernetes_node_scale benchmark — Template and scaling logic (scale up, scale down, phases), metrics collection, and timeouts tuned for large runs.
EKS + Karpenter — Nodepool template (instance types including
t, CPU limit derived from scale target), EKS/Karpenter cluster lifecycle and cleanup.Karpenter scaling by node count — NodePool CPU limit is computed from
kubernetes_scale_num_nodes:max(1000, ceil(nodes × 2 × 1.05))(e.g. 10 nodes → 1000, 5k → 10500). Controller pod resources scale with the same flag:One configuration works for both small and 5k-node runs.
Teardown robustness — Orphan ENI deletion in
_CleanupKarpenter: retry with backoff on AWS throttle (RequestLimitExceeded), treat "ENI not found" as success; usessuppress_failurefor these cases.Tracker — Single
get nodespass in_StopWatchingForNodeChanges; resolve machine type only for current nodes, use"unknown"for others to avoid thousands of kubectl calls on 5k-node runs.Tests —
kubernetes_scale_benchmark_testmocks updated to return valid kubectl-o jsonoutput ({"items": [...]}) so tests pass afterGetStatusConditionsForResourceTypewas switched from jsonpath to full JSON.