Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
banner.png		banner.png
break.sh		break.sh
build.sh		build.sh
clean_nodes.sh		clean_nodes.sh
cluster.yml		cluster.yml
test-deployment.yaml		test-deployment.yaml
verify.sh		verify.sh

README.md

Pods not being scheduled with OPA Gatekeeper

How can OPA Gatekeeper break my cluster?

Gatekeeper uses validatingwebhookconfigurations to screen updates request being sent to kube-apiserver to verify they pass Gatekeep's checks. If OPA Gatekeeper is down, these requests will fail, which will break the kube-scheduler because all the update requests will be blocked. NOTE: OPA Gatekeeper can be set to fail open IE if OPA Gatekeeper is down; assume it would have been approved and move forward.

Reproducing in a lab

Prerequisites
- Latest RKE
- Latest kubectl
- Latest helm
- 3 VMs (2 core, 4GB of RAM, 20GB root)
- SSH access to root on all nodes
- Internet access to github and docker hub.
- Running Docker Install Script
Edit the cluster.yml to include your node IPs
```
vi ./cluster.yml
```
Stand up the cluster
```
bash ./build.sh
```
Verify the cluster is up and healthy
```
bash ./verify.sh
```
Break the cluster
```
bash ./break.sh
```

Identifying the issue

Error messages in kube-scheduler logs.

docker logs --tail 10 -t kube-scheduler

2021-05-08T04:44:41.406070907Z E0508 04:44:41.405968       1 leaderelection.go:361] Failed to update lock: Internal error occurred: failed calling webhook "validation.gatekeeper.sh": Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admit?timeout=3s": dial tcp 10.43.104.236:443: connect: connection refused

The deployment will show as out-of-spec, but kubectl get pods won't show any errors.

kubectl get deployment/hello-world
kubectl get pods -l app=hello-world

NAME          READY   UP-TO-DATE   AVAILABLE   AGE
hello-world   3/4     3            3           23m 
NAME                           READY   STATUS    RESTARTS   AGE
hello-world-678c699476-7h9q4   1/1     Running   0          12m
hello-world-678c699476-dmszr   1/1     Running   0          11m
hello-world-678c699476-f27jb   1/1     Running   0          24m

Troubleshooting

Find which node is currently running the kube-scheduler

NODE="$(kubectl get leases -n kube-system kube-scheduler -o 'jsonpath={.spec.holderIdentity}' | awk -F '_' '{print $1}')"
echo "kube-scheduler is the leader on node $NODE"

Review the docker logs

docker logs --tail 100 -t kube-scheduler

Try

Restoring/Recovering

Setting the failure policy to fail open.

kubectl get ValidatingWebhookConfiguration gatekeeper-validating-webhook-configuration -o yaml | sed 's/failurePolicy.*/failurePolicy: Ignore/g' | kubectl apply -f -

If an open policy doesn't work, remove all Gatekeeper admission checks.

kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io gatekeeper-validating-webhook-configuration

Preventive tasks

Changing the failure policy to fail open. Doc
Offical OPA Gatekeeper Emergency Recovery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broken-opa-gatekeeper

broken-opa-gatekeeper

README.md

Pods not being scheduled with OPA Gatekeeper

How can OPA Gatekeeper break my cluster?

Reproducing in a lab

Identifying the issue

Troubleshooting

Restoring/Recovering

Preventive tasks

Files

broken-opa-gatekeeper

Directory actions

More options

Directory actions

More options

Latest commit

History

broken-opa-gatekeeper

Folders and files

parent directory

README.md

Pods not being scheduled with OPA Gatekeeper​

How can OPA Gatekeeper break my cluster?

Reproducing in a lab

Identifying the issue

Troubleshooting

Restoring/Recovering

Preventive tasks

Pods not being scheduled with OPA Gatekeeper