Features to allow async drift / budgets (eg for Azure) #1861

charliedmcb · 2024-12-03T00:39:00Z

Description

Note:

This is a known BUG, with a known cause. It also applies to other forms of Disruption which also provision replacement nodes, beyond just Drift. It requires both changes in kubernetes-sigs/karpenter, along with changes in the Azure/karpenter-provider-azure repo, so opening a BUG in both (tracking Azure/karpenter-provider-azure side here 600)

Observed Behavior:
Karpenter is only provisioning and replacing 1 Drifted node at a time, even with a Budget of "5".

Had all 10 nodeclaims marked as Drifted:

kubectl get nodeclaims -o go-template='{{range $item := .items}}{{with $nodeclaimname := $item.metadata.name}}{{range $condition := $item.status.conditions}}{{if and (eq $condition.type "Drifted") (eq $condition.status "True")}}{{printf "%s\n" $nodeclaimname}}{{end}}{{end}}{{end}}{{end}}'

general-purpose-4d8rn
general-purpose-brt2r
general-purpose-l9pv6
general-purpose-n6xv7
general-purpose-s9rnn
general-purpose-v76nm
general-purpose-vrnbl
general-purpose-vt4mv
general-purpose-wkk62
general-purpose-x6b47

Even after over a minute, there was only one new nodeclaim being created:

kubectl get karpenter

NAME                                       AGE
aksnodeclass.karpenter.azure.com/default   99m

NAME                                           TYPE               CAPACITY    ZONE        NODE                        READY     AGE
nodeclaim.karpenter.sh/general-purpose-4d8rn   Standard_D2ls_v5   on-demand   westus2-3   aks-general-purpose-4d8rn   True      78m
nodeclaim.karpenter.sh/general-purpose-brt2r   Standard_D2ls_v5   on-demand   westus2-1   aks-general-purpose-brt2r   True      70m
nodeclaim.karpenter.sh/general-purpose-l9pv6   Standard_D2ls_v5   on-demand   westus2-3   aks-general-purpose-l9pv6   True      52m
nodeclaim.karpenter.sh/general-purpose-n6xv7   Standard_D2as_v5   on-demand   westus2-1   aks-general-purpose-n6xv7   True      85m
nodeclaim.karpenter.sh/general-purpose-s9rnn   Standard_D2as_v5   on-demand   westus2-2   aks-general-purpose-s9rnn   True      48m
nodeclaim.karpenter.sh/general-purpose-v76nm   Standard_D2ls_v5   on-demand   westus2-1   aks-general-purpose-v76nm   True      68m
nodeclaim.karpenter.sh/general-purpose-vrnbl   Standard_D2as_v5   on-demand   westus2-3   aks-general-purpose-vrnbl   True      82m
nodeclaim.karpenter.sh/general-purpose-vt4mv   Standard_D2ls_v5   on-demand   westus2-3   aks-general-purpose-vt4mv   True      76m
nodeclaim.karpenter.sh/general-purpose-wkk62   Standard_D2ls_v5   on-demand   westus2-1   aks-general-purpose-wkk62   True      67m
nodeclaim.karpenter.sh/general-purpose-x6b47   Standard_D2ls_v5   on-demand   westus2-2   aks-general-purpose-x6b47   True      50m
nodeclaim.karpenter.sh/general-purpose-xf2ss                                                                          Unknown   82s

NAME                                    NODECLASS   NODES   READY   AGE
nodepool.karpenter.sh/general-purpose   default     10      True    98m

And only one node tainted as disrupted:

kubectl get nodes -o go-template='{{range $item := .items}}{{with $nodename := $item.metadata.name}}{{range $taint := $item.spec.taints}}{{if and (eq $taint.key "karpenter.sh/disrupted") (eq $taint.effect "NoSchedule")}}{{printf "%s\n" $nodename}}{{end}}{{end}}{{end}}{{end}}'

aks-general-purpose-4d8rn

Looking at the logs, we can see that the cluster is stalled on waiting on cluster sync:

{
  "level": "DEBUG",
  "time": "2024-12-03T00:20:35.932Z",
  "logger": "controller",
  "caller": "singleton/controller.go:26",
  "message": "waiting on cluster sync",
  "commit": "f1b33a7",
  "controller": "disruption",
  "namespace": "",
  "name": "",
  "reconcileID": "c11d69f3-97eb-4677-8c48-54bccc32a1bf"
}

This is due to the patterning that exists between kubernetes-sigs/karpenter, and Azure/karpenter-provider-azure, stalling the NodeClaim creation at the CloudProvider Create call:

karpenter/pkg/controllers/nodeclaim/lifecycle/launch.go

Line 74 in b1b45fc

created, err := l.cloudProvider.Create(ctx, nodeClaim)

Internally, the Azure provider has a few LRO (long running operations) which are part of this creation call, which has polling preformed on the call to Azure to ensure the resources have been created correctly, reporting back any issue otherwise. If this Polling is skipped, there would be an issue of certain types of errors on node creation being missed, and thus certain errors would have to wait for the registrationTTL of 15 min, which is unacceptable in these cases.

Expected Behavior:
Karpenter to provision X nodes equal to the Budget for Drift (in this case 5) asynchronously, to replace the Drifted nodes in accordance with the Budget.

Reproduction Steps (Please include YAML):

Deployed a dev version of Azure Karpenter, from the HEAD of main, using make az-all:
- de7dee7

ran make az-taintsystemnodes to ensure workload pods will only land on karpenter nodes.

Updated, and applied: examples/v1/general-purpose.yaml with karpenter.azure.com/sku-cpu to less than 3, budgets.nodes to "5", and consolidateAfter to Never:

# This example NodePool will provision general purpose instances
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-purpose
  annotations:
    kubernetes.io/description: "General purpose NodePool"
spec:
  disruption:
    consolidateAfter: Never
    budgets:
    - nodes: "5"
  # Optional: Uncomment if you want to put a cap on the max resources available for provisioning
  # limits:
  #   cpu: "30"
  template:
    metadata:
      labels:
        # required for Karpenter to predict overhead from cilium DaemonSet
        kubernetes.azure.com/ebpf-dataplane: cilium
    spec:
      nodeClassRef:
        group: karpenter.azure.com
        kind: AKSNodeClass
        name: default
      startupTaints:
      # https://karpenter.sh/docs/concepts/nodepools/#cilium-startup-taint
      - key: node.cilium.io/agent-not-ready
        effect: NoExecute
        value: "true"
      expireAfter: Never
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
      - key: kubernetes.io/os
        operator: In
        values: ["linux"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]
      - key: karpenter.azure.com/sku-family
        operator: In
        values: [D]
      # Optional: Uncomment if you want to add a restriction on max sku cpus.
      #           Useful for ensuring karpneter provisions multiple nodes for feature testing.
      - key: karpenter.azure.com/sku-cpu
        operator: Lt
        values: ["3"]
---
apiVersion: karpenter.azure.com/v1alpha2
kind: AKSNodeClass
metadata:
  name: default
  annotations:
    kubernetes.io/description: "General purpose AKSNodeClass for running Ubuntu2204 nodes"
spec:
  imageFamily: Ubuntu2204

Updated, and applied: examples/workloads/inflate.yaml with replicas to 10, and cpu requests to 1.1:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 10
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      containers:
      - image: mcr.microsoft.com/oss/kubernetes/pause:3.6
        name: inflate
        resources:
          requests:
            cpu: "1.1"
            memory: 256M

Update imageFamily: AzureLinux in the AKSNodeClass, to trigger a Drift.

Versions:

Chart Version: Using Dev Build of Azure provider on HEAD of main with commit d84e7c8
- sigs.k8s.io/karpenter: v1.0.5
Kubernetes Version (kubectl version): v1.29.9

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

sftim · 2024-12-10T15:25:00Z

For Karpenter core, is the ask here more of a feature request? (I don't see a bug in the core library).

jonathan-innis · 2024-12-10T17:04:55Z

+1 @sftim This strikes me as a feature request for the upstream library as well. @Bryce-Soghigian @tallaxes Perhaps y'all have some more thoughts on the interactions here between the upstream library and the Azure provider?

jonathan-innis · 2024-12-10T17:08:05Z

If this Polling is skipped, there would be an issue of certain types of errors on node creation being missed, and thus certain errors would have to wait for the registrationTTL of 15 min

What are the circumstances that this kind of issue occurs in Azure? IMO, this may really just exist as a feature request in the Azure provider if we can. AWS doesn't have the same problem FWIW because it returns back the response fairly quickly and gets us out of this loop

jonathan-innis · 2024-12-10T17:09:08Z

/triage accepted

jonathan-innis · 2024-12-10T17:09:41Z

/area provider/azure

sftim · 2024-12-10T17:10:40Z

/remove-kind bug
/kind feature
/retitle Features to allow async drift / budgets (eg for Azure)

charliedmcb · 2024-12-10T21:41:17Z

@sftim the reason I put it as a BUG instead of a feature request was that from the design docs for Budgets, I assumed the feature would be compatible with Azure, but with the given implementation it doesn't work with the Azure provider. I tried finding a work around on the Azure side; however, I haven't found a way to be able to integrate the provider with the upstream implementation of this feature. I'm fine with the being labeled as a feature request though, if that aligns better.

charliedmcb · 2024-12-10T21:55:23Z

@jonathan-innis, I don't think this is purely on the Azure side. I'm aware of the difference here with AWS' fast response compared to Azure having the LRO. I did some testing on removing the Polling from Azure on the LRO, and we can get a fast response rate in that case and consequently get async drift/budget working correctly. However, we then miss out on gracefully handling certain types of errors as I mentioned. I can do a collection on what some of those error types are. However, it's a little tricky to get a complete list, and have been told by some other members of Azure it'd be good to handle them.

If we were to handle this adjustment for Azure jointly between Upstream, and the Azure Provider, I could see something like:

Azure shortcuts its Polling to return back quickly.
The controller which currently watches for the registrationTTL could watch for another form of Failure status on the NodeClaim, which the cloud provider/another source could signal if it was known the node wasn't going to provision.
Azure would do its Polling in the background and update the failure status to remove the NodeClaim early if it detected one of these issues.

Could also tackle it from the side of disruption triggering things in batches/a slightly different way. This is blocked on not having the ProviderId returned and set on the NodeClaim when attempting to disrupt more.

charliedmcb added the kind/bug Categorizes issue or PR as related to a bug. label Dec 3, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 3, 2024

charliedmcb changed the title ~~BUG: Budgets Don't allow for async Drift in combination with Azure.~~ BUG: Budgets don't allow for async Drift in combination with Azure. Dec 3, 2024

charliedmcb mentioned this issue Dec 3, 2024

BUG: Budgets don't allow for async Drift Azure/karpenter-provider-azure#600

Open

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 10, 2024

k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Dec 10, 2024

k8s-ci-robot changed the title ~~BUG: Budgets don't allow for async Drift in combination with Azure.~~ Features to allow async drift / budgets (eg for Azure) Dec 10, 2024

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features to allow async drift / budgets (eg for Azure) #1861

Features to allow async drift / budgets (eg for Azure) #1861

charliedmcb commented Dec 3, 2024 •

edited

Loading

sftim commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

sftim commented Dec 10, 2024

charliedmcb commented Dec 10, 2024

charliedmcb commented Dec 10, 2024 •

edited

Loading

Features to allow async drift / budgets (eg for Azure) #1861

Features to allow async drift / budgets (eg for Azure) #1861

Comments

charliedmcb commented Dec 3, 2024 • edited Loading

Description

sftim commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

sftim commented Dec 10, 2024

charliedmcb commented Dec 10, 2024

charliedmcb commented Dec 10, 2024 • edited Loading

charliedmcb commented Dec 3, 2024 •

edited

Loading

charliedmcb commented Dec 10, 2024 •

edited

Loading