Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
0402b02
Add Azure 1 GPU kubernetes_scale
vofish Dec 15, 2025
a5ca221
Add newline
vofish Dec 15, 2025
0914882
Address comments
vofish Dec 24, 2025
9fc7703
Remove unnecessary line
vofish Dec 24, 2025
113874f
Remove unnecessary line
vofish Dec 24, 2025
6173672
Put NVIDIA plugin yaml to data/container/azure folder
vofish Jan 7, 2026
dbe0c2b
Add precommit pyink formatter
vofish Jul 25, 2025
7532b2c
Add newline
vofish Jul 25, 2025
d84a68d
Update rev., add args
vofish Jul 25, 2025
6baa2f7
Update pre-commit hook args
vofish Jul 29, 2025
9ae39a3
Update the Postgres sysbench configuration logic.
Dec 15, 2025
b957e2b
Remove the node selector for (new) vpa test
Dec 15, 2025
b2eb291
Fix Vertex AI DNS endpoint & remove beta
hubatish Dec 15, 2025
be9f67a
Fix to rampup test to ensure correct metric collector runs
Dec 15, 2025
6909b40
Add Azure support to ai-inference
vofish Nov 27, 2025
843eb01
Address comments
vofish Dec 2, 2025
21868f1
Fix tests
vofish Dec 4, 2025
c1ea5bb
Fix tests
vofish Dec 4, 2025
ecf6162
Add _ProvisionGPUNodePool method
vofish Dec 5, 2025
bc2f146
Revert blob-public-access
vofish Dec 10, 2025
f151363
Fix linting issues
vofish Dec 12, 2025
85b9239
Fix linting issues
vofish Dec 12, 2025
5868fe4
Support triggering migration multiple times.
raymond13513 Dec 16, 2025
01373cf
Implement backup restore for DSQL
ScottLinnn Dec 16, 2025
16df725
Update base Windows mixin to offer `RemoteCommandWithReturnCode` and …
jacklacey11 Dec 16, 2025
e719647
Run llama4 16e-instruct rather than 16e (without).
hubatish Dec 17, 2025
2d49929
Add '.j2' to Azure blobfuse2 config file.
andyz422 Dec 17, 2025
589836c
Move container_service into it's own directory/module
Dec 17, 2025
6489719
Refactor kubernetes items out of container_service/__init__.py
Dec 17, 2025
85464b9
Refactor base items out of container_service/__init__.py
Dec 17, 2025
b4f887a
Default capture_live_migration_timestamps to true.
raymond13513 Dec 17, 2025
37d046e
Refactor BUILD file to create container_service library
Dec 18, 2025
305f713
Move yaml code from container_service -> vm_util
hubatish Dec 18, 2025
5239ccb
Update Linux VM metadata to:
pmkc Dec 18, 2025
a90ede8
Allow configuring worker count in Trino
hubatish Dec 19, 2025
b22e16d
Support gs:// URLs in --ycsb_tar_url
Dec 19, 2025
eeb5acf
Add support for specifying storage type, IOPS, and throughput for Azu…
bvliu Dec 19, 2025
c5bbbd9
Refactor relational DB metrics collection.
bvliu Dec 19, 2025
84285f1
Add Azure Flexible Server metrics implementation.
bvliu Dec 19, 2025
86f7c04
Fix sysbench sleep duration.
bvliu Dec 19, 2025
0f93c5c
Added GCE SQL Server 2025 images to PKB
Dec 20, 2025
a0748fc
Remove unused disk_iops_to_capacity module.
jellyfishcake Jan 2, 2026
536e4ca
Add the option to enable `--redis_aof_verify` at the end of `Run` pha…
Jan 2, 2026
f5a8ccc
Add support for MSSQL 2025 on Linux and update SQL Server configurati…
Jan 5, 2026
429e3b9
Pass region to aws command
ScottLinnn Jan 5, 2026
002e015
Support exporting metrics for multiple triggers for disruption.
raymond13513 Jan 5, 2026
75a3d7c
Reclassify timeouts on startup script retrieval to indicate the poten…
jacklacey11 Jan 5, 2026
a970e51
Allow configurable compaction strategy for cassandra
Arushi-07 Jan 6, 2026
ee7dfd7
Fix Create test
vofish Jan 7, 2026
5ba0a44
Merge branch 'master' into azure-matrix-1gpu
vofish Jan 8, 2026
38ec118
Remove pre-commit hook
vofish Jan 8, 2026
6ced164
Apply linting
vofish Jan 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,47 @@ spec:
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: {{ PodTimeout }}
---
{%- if NvidiaGpuRequest and cloud == "Azure" %}
# According to the official Microsoft documentation, the NVIDIA device plugin
# must be deployed as a DaemonSet to enable GPU support in the Kubernetes cluster.
# Reference: https://learn.microsoft.com/en-us/azure/aks/use-nvidia-gpu?tabs=add-ubuntu-gpu-node-pool#nvidia-device-plugin-installation
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: 'nvidia.com/gpu'
operator: Exists
effect: NoSchedule
priorityClassName: 'system-node-critical'
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.18.0
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: 'false'
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ['ALL']
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
{%- endif %}
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,7 @@ def ScaleUpPods(
EphemeralStorageRequest='10Mi',
RolloutTimeout=max_wait_time,
PodTimeout=resource_timeout,
cloud=FLAGS.cloud,
)
cluster.ModifyPodSpecPlacementYaml(
yaml_docs,
Expand Down