Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features to allow async drift / budgets (eg for Azure) #1861

Open
charliedmcb opened this issue Dec 3, 2024 · 8 comments
Open

Features to allow async drift / budgets (eg for Azure) #1861

charliedmcb opened this issue Dec 3, 2024 · 8 comments
Labels
area/provider/azure Issues or PRs related to azure provider kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@charliedmcb
Copy link

charliedmcb commented Dec 3, 2024

Description

Note:

This is a known BUG, with a known cause. It also applies to other forms of Disruption which also provision replacement nodes, beyond just Drift. It requires both changes in kubernetes-sigs/karpenter, along with changes in the Azure/karpenter-provider-azure repo, so opening a BUG in both (tracking Azure/karpenter-provider-azure side here 600)
 
Observed Behavior:
Karpenter is only provisioning and replacing 1 Drifted node at a time, even with a Budget of "5".

Had all 10 nodeclaims marked as Drifted:

kubectl get nodeclaims -o go-template='{{range $item := .items}}{{with $nodeclaimname := $item.metadata.name}}{{range $condition := $item.status.conditions}}{{if and (eq $condition.type "Drifted") (eq $condition.status "True")}}{{printf "%s\n" $nodeclaimname}}{{end}}{{end}}{{end}}{{end}}'
general-purpose-4d8rn
general-purpose-brt2r
general-purpose-l9pv6
general-purpose-n6xv7
general-purpose-s9rnn
general-purpose-v76nm
general-purpose-vrnbl
general-purpose-vt4mv
general-purpose-wkk62
general-purpose-x6b47

Even after over a minute, there was only one new nodeclaim being created:

kubectl get karpenter
NAME                                       AGE
aksnodeclass.karpenter.azure.com/default   99m

NAME                                           TYPE               CAPACITY    ZONE        NODE                        READY     AGE
nodeclaim.karpenter.sh/general-purpose-4d8rn   Standard_D2ls_v5   on-demand   westus2-3   aks-general-purpose-4d8rn   True      78m
nodeclaim.karpenter.sh/general-purpose-brt2r   Standard_D2ls_v5   on-demand   westus2-1   aks-general-purpose-brt2r   True      70m
nodeclaim.karpenter.sh/general-purpose-l9pv6   Standard_D2ls_v5   on-demand   westus2-3   aks-general-purpose-l9pv6   True      52m
nodeclaim.karpenter.sh/general-purpose-n6xv7   Standard_D2as_v5   on-demand   westus2-1   aks-general-purpose-n6xv7   True      85m
nodeclaim.karpenter.sh/general-purpose-s9rnn   Standard_D2as_v5   on-demand   westus2-2   aks-general-purpose-s9rnn   True      48m
nodeclaim.karpenter.sh/general-purpose-v76nm   Standard_D2ls_v5   on-demand   westus2-1   aks-general-purpose-v76nm   True      68m
nodeclaim.karpenter.sh/general-purpose-vrnbl   Standard_D2as_v5   on-demand   westus2-3   aks-general-purpose-vrnbl   True      82m
nodeclaim.karpenter.sh/general-purpose-vt4mv   Standard_D2ls_v5   on-demand   westus2-3   aks-general-purpose-vt4mv   True      76m
nodeclaim.karpenter.sh/general-purpose-wkk62   Standard_D2ls_v5   on-demand   westus2-1   aks-general-purpose-wkk62   True      67m
nodeclaim.karpenter.sh/general-purpose-x6b47   Standard_D2ls_v5   on-demand   westus2-2   aks-general-purpose-x6b47   True      50m
nodeclaim.karpenter.sh/general-purpose-xf2ss                                                                          Unknown   82s

NAME                                    NODECLASS   NODES   READY   AGE
nodepool.karpenter.sh/general-purpose   default     10      True    98m

And only one node tainted as disrupted:

kubectl get nodes -o go-template='{{range $item := .items}}{{with $nodename := $item.metadata.name}}{{range $taint := $item.spec.taints}}{{if and (eq $taint.key "karpenter.sh/disrupted") (eq $taint.effect "NoSchedule")}}{{printf "%s\n" $nodename}}{{end}}{{end}}{{end}}{{end}}'
aks-general-purpose-4d8rn

Looking at the logs, we can see that the cluster is stalled on waiting on cluster sync:

{
  "level": "DEBUG",
  "time": "2024-12-03T00:20:35.932Z",
  "logger": "controller",
  "caller": "singleton/controller.go:26",
  "message": "waiting on cluster sync",
  "commit": "f1b33a7",
  "controller": "disruption",
  "namespace": "",
  "name": "",
  "reconcileID": "c11d69f3-97eb-4677-8c48-54bccc32a1bf"
}

This is due to the patterning that exists between kubernetes-sigs/karpenter, and Azure/karpenter-provider-azure, stalling the NodeClaim creation at the CloudProvider Create call:

created, err := l.cloudProvider.Create(ctx, nodeClaim)

Internally, the Azure provider has a few LRO (long running operations) which are part of this creation call, which has polling preformed on the call to Azure to ensure the resources have been created correctly, reporting back any issue otherwise. If this Polling is skipped, there would be an issue of certain types of errors on node creation being missed, and thus certain errors would have to wait for the registrationTTL of 15 min, which is unacceptable in these cases.

Expected Behavior:
Karpenter to provision X nodes equal to the Budget for Drift (in this case 5) asynchronously, to replace the Drifted nodes in accordance with the Budget.

Reproduction Steps (Please include YAML):

Deployed a dev version of Azure Karpenter, from the HEAD of main, using make az-all:
- de7dee7

ran make az-taintsystemnodes to ensure workload pods will only land on karpenter nodes.

Updated, and applied: examples/v1/general-purpose.yaml with karpenter.azure.com/sku-cpu to less than 3, budgets.nodes to "5", and consolidateAfter to Never:

# This example NodePool will provision general purpose instances
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-purpose
  annotations:
    kubernetes.io/description: "General purpose NodePool"
spec:
  disruption:
    consolidateAfter: Never
    budgets:
    - nodes: "5"
  # Optional: Uncomment if you want to put a cap on the max resources available for provisioning
  # limits:
  #   cpu: "30"
  template:
    metadata:
      labels:
        # required for Karpenter to predict overhead from cilium DaemonSet
        kubernetes.azure.com/ebpf-dataplane: cilium
    spec:
      nodeClassRef:
        group: karpenter.azure.com
        kind: AKSNodeClass
        name: default
      startupTaints:
      # https://karpenter.sh/docs/concepts/nodepools/#cilium-startup-taint
      - key: node.cilium.io/agent-not-ready
        effect: NoExecute
        value: "true"
      expireAfter: Never
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64"]
      - key: kubernetes.io/os
        operator: In
        values: ["linux"]
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]
      - key: karpenter.azure.com/sku-family
        operator: In
        values: [D]
      # Optional: Uncomment if you want to add a restriction on max sku cpus.
      #           Useful for ensuring karpneter provisions multiple nodes for feature testing.
      - key: karpenter.azure.com/sku-cpu
        operator: Lt
        values: ["3"]
---
apiVersion: karpenter.azure.com/v1alpha2
kind: AKSNodeClass
metadata:
  name: default
  annotations:
    kubernetes.io/description: "General purpose AKSNodeClass for running Ubuntu2204 nodes"
spec:
  imageFamily: Ubuntu2204

Updated, and applied: examples/workloads/inflate.yaml with replicas to 10, and cpu requests to 1.1:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 10
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      containers:
      - image: mcr.microsoft.com/oss/kubernetes/pause:3.6
        name: inflate
        resources:
          requests:
            cpu: "1.1"
            memory: 256M

Update imageFamily: AzureLinux in the AKSNodeClass, to trigger a Drift.

Versions:

  • Chart Version: Using Dev Build of Azure provider on HEAD of main with commit d84e7c8
    • sigs.k8s.io/karpenter: v1.0.5
  • Kubernetes Version (kubectl version): v1.29.9
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@charliedmcb charliedmcb added the kind/bug Categorizes issue or PR as related to a bug. label Dec 3, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 3, 2024
@charliedmcb charliedmcb changed the title BUG: Budgets Don't allow for async Drift in combination with Azure. BUG: Budgets don't allow for async Drift in combination with Azure. Dec 3, 2024
@sftim
Copy link

sftim commented Dec 10, 2024

For Karpenter core, is the ask here more of a feature request? (I don't see a bug in the core library).

@jonathan-innis
Copy link
Member

+1 @sftim This strikes me as a feature request for the upstream library as well. @Bryce-Soghigian @tallaxes Perhaps y'all have some more thoughts on the interactions here between the upstream library and the Azure provider?

@jonathan-innis
Copy link
Member

If this Polling is skipped, there would be an issue of certain types of errors on node creation being missed, and thus certain errors would have to wait for the registrationTTL of 15 min

What are the circumstances that this kind of issue occurs in Azure? IMO, this may really just exist as a feature request in the Azure provider if we can. AWS doesn't have the same problem FWIW because it returns back the response fairly quickly and gets us out of this loop

@jonathan-innis
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 10, 2024
@jonathan-innis
Copy link
Member

/area provider/azure

@k8s-ci-robot k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Dec 10, 2024
@sftim
Copy link

sftim commented Dec 10, 2024

/remove-kind bug
/kind feature
/retitle Features to allow async drift / budgets (eg for Azure)

@k8s-ci-robot k8s-ci-robot changed the title BUG: Budgets don't allow for async Drift in combination with Azure. Features to allow async drift / budgets (eg for Azure) Dec 10, 2024
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Dec 10, 2024
@charliedmcb
Copy link
Author

@sftim the reason I put it as a BUG instead of a feature request was that from the design docs for Budgets, I assumed the feature would be compatible with Azure, but with the given implementation it doesn't work with the Azure provider. I tried finding a work around on the Azure side; however, I haven't found a way to be able to integrate the provider with the upstream implementation of this feature. I'm fine with the being labeled as a feature request though, if that aligns better.

@charliedmcb
Copy link
Author

charliedmcb commented Dec 10, 2024

@jonathan-innis, I don't think this is purely on the Azure side. I'm aware of the difference here with AWS' fast response compared to Azure having the LRO. I did some testing on removing the Polling from Azure on the LRO, and we can get a fast response rate in that case and consequently get async drift/budget working correctly. However, we then miss out on gracefully handling certain types of errors as I mentioned. I can do a collection on what some of those error types are. However, it's a little tricky to get a complete list, and have been told by some other members of Azure it'd be good to handle them.

If we were to handle this adjustment for Azure jointly between Upstream, and the Azure Provider, I could see something like:

  1. Azure shortcuts its Polling to return back quickly.
  2. The controller which currently watches for the registrationTTL could watch for another form of Failure status on the NodeClaim, which the cloud provider/another source could signal if it was known the node wasn't going to provision.
  3. Azure would do its Polling in the background and update the failure status to remove the NodeClaim early if it detected one of these issues.

Could also tackle it from the side of disruption triggering things in batches/a slightly different way. This is blocked on not having the ProviderId returned and set on the NodeClaim when attempting to disrupt more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/azure Issues or PRs related to azure provider kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants