Skip to content

Latest commit

 

History

History

broken-opa-gatekeeper

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Pods not being scheduled with OPA Gatekeeper​

How can OPA Gatekeeper break my cluster?

Gatekeeper uses validatingwebhookconfigurations to screen updates request being sent to kube-apiserver to verify they pass Gatekeep's checks. If OPA Gatekeeper is down, these requests will fail, which will break the kube-scheduler because all the update requests will be blocked. NOTE: OPA Gatekeeper can be set to fail open IE if OPA Gatekeeper is down; assume it would have been approved and move forward.

Reproducing in a lab

  • Prerequisites
  • Edit the cluster.yml to include your node IPs
    vi ./cluster.yml
    
  • Stand up the cluster
    bash ./build.sh
    
  • Verify the cluster is up and healthy
    bash ./verify.sh
    
  • Break the cluster
    bash ./break.sh
    

Identifying the issue

  • Error messages in kube-scheduler logs.
    docker logs --tail 10 -t kube-scheduler
    
    2021-05-08T04:44:41.406070907Z E0508 04:44:41.405968       1 leaderelection.go:361] Failed to update lock: Internal error occurred: failed calling webhook "validation.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admit?timeout=3s": dial tcp 10.43.104.236:443: connect: connection refused
    
  • The deployment will show as out-of-spec, but kubectl get pods won't show any errors.
    kubectl get deployment/hello-world
    kubectl get pods -l app=hello-world
    
    NAME          READY   UP-TO-DATE   AVAILABLE   AGE
    hello-world   3/4     3            3           23m 
    NAME                           READY   STATUS    RESTARTS   AGE
    hello-world-678c699476-7h9q4   1/1     Running   0          12m
    hello-world-678c699476-dmszr   1/1     Running   0          11m
    hello-world-678c699476-f27jb   1/1     Running   0          24m
    

Troubleshooting

  • Find which node is currently running the kube-scheduler
    NODE="$(kubectl get leases -n kube-system kube-scheduler -o 'jsonpath={.spec.holderIdentity}' | awk -F '_' '{print $1}')"
    echo "kube-scheduler is the leader on node $NODE"
    
  • Review the docker logs
    docker logs --tail 100 -t kube-scheduler
    
  • Try

Restoring/Recovering

  • Setting the failure policy to fail open.
    kubectl get ValidatingWebhookConfiguration gatekeeper-validating-webhook-configuration -o yaml | sed 's/failurePolicy.*/failurePolicy: Ignore/g' | kubectl apply -f -
    
  • If an open policy doesn't work, remove all Gatekeeper admission checks.
    kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io gatekeeper-validating-webhook-configuration
    

Preventive tasks