Skip to content

Conversation

@vofish
Copy link
Collaborator

@vofish vofish commented Dec 15, 2025

Add support for kubernetes_scale to 1 GPU on AKS Standard:

  • Update container/kubernetes_scale/kubernetes_scale.yaml.j2 to intall the NVIDIA device plugin
  • Add cloud yaml_docs manifest

Command to run:

./pkb.py --cloud=Azure --benchmarks=kubernetes_scale \
--config_override='kubernetes_scale.container_cluster.vm_spec.Azure.zone="westus2"' \
--config_override='kubernetes_scale.container_cluster.type="Kubernetes"' \
--config_override=kubernetes_scale.container_cluster.max_vm_count=4 \
--config_override='kubernetes_scale.container_cluster.vm_spec.Azure.machine_type="Standard_NC8as_T4_v3"' \
--gpu_count=1 --gpu_type=t4 \
--kubernetes_scale_num_replicas=2 --kubernetes_scale_pod_cpus=4 --kubernetes_scale_pod_memory=4G \
--kubernetes_scale_report_individual_latencies=True --kubernetes_scale_report_latency_percentiles=False \
--metadata=cloud:Azure  --timeout_minutes=236

Copy link

@rsgowman rsgowman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, though let's get Zach to review too.

@bvliu bvliu requested a review from hubatish January 5, 2026 15:37
Copy link
Collaborator

@hubatish hubatish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the tests are failing. In cloud Build logs (which you probably don't have access to?) I see:
INFO 2025-12-24T16:11:52.906172833Z Step #1: ERROR: testSetStorageSizeGCP (tests.disk_iops_to_capacity_test.DiskIOPSToCapacityTest.testSetStorageSizeGCP)
....
INFO 2025-12-24T16:11:52.906179565Z Step #1: self._cpu_count = int(
INFO 2025-12-24T16:11:52.906180155Z Step #1: ^^^^
INFO 2025-12-24T16:11:52.906180893Z Step #1: TypeError: only 0-dimensional arrays can be converted to Python scalars

..this doesn't look like your code. Possibly just syncing forward will fix it since this is 2 weeks old.

vofish and others added 22 commits January 7, 2026 16:05
This is no longer used and causes the test to fail as of
GoogleCloudPlatform#6291

PiperOrigin-RevId: 844830523
model-garden CLI service is properly live now.

PiperOrigin-RevId: 844840901
See GoogleCloudPlatform#6272. While
refactoring the Run() method, the KubernetesMetricCollector was omitted,
causing HPA's KMC to be run during VPA's test. This caused the primary metrics
of the test to not be collected.

PiperOrigin-RevId: 844844745
PiperOrigin-RevId: 845429012
…the existing `RemoteCommand`.

PiperOrigin-RevId: 845429895
PiperOrigin-RevId: 845833015
p3rf Team and others added 24 commits January 7, 2026 16:30
Preparation for refactoring this file.

PiperOrigin-RevId: 845900151
Also fix resulting pytype errors that this exposed.

PiperOrigin-RevId: 846252648
1. Always show the kernel command line
2. Show if RT kernel was enabled

PiperOrigin-RevId: 846435753
PiperOrigin-RevId: 846453023
PiperOrigin-RevId: 846466524
…re SQL databases.

PiperOrigin-RevId: 846591475
PiperOrigin-RevId: 846606398
PiperOrigin-RevId: 846749098
PiperOrigin-RevId: 847010068
…se for Redis Memtier Benchmark

PiperOrigin-RevId: 851458383
…on to include trace flag 9944.

PiperOrigin-RevId: 852305770
PiperOrigin-RevId: 852383875
…tial source issue from the startup script service

PiperOrigin-RevId: 852450925
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.