Pods failing with Init:CrashLoopBackOff
.
oc logs centraldashboard-*** -n kubeflow -c istio-init
gives
iptables-restore --noflush /tmp/iptables-rules-1648485349812569495.txt081246391
iptables-restore v1.6.1: iptables-restore: unable to initialize table 'nat'
modprobe br_netfilter ; modprobe nf_nat ; modprobe xt_REDIRECT ; modprobe xt_owner; modprobe iptable_nat; modprobe iptable_mangle; modprobe iptable_filter
(see istio/istio#23009)
Upon deletion, a namespace remains in terminating
state.
oc get ns
Checking if any apiservice is unavailable and hence doesn't serve its resources:
kubectl get apiservice|grep False
(see kubernetes/kubernetes#60807 (comment))
Finding all resources that still exist via:
oc api-resources --verbs=list --namespaced -o name | xargs -t -n 1 oc get --show-kind --ignore-not-found -n $PROJECT_NAME
(see https://access.redhat.com/solutions/4165791)
Example issue (invalid CA bundle):
oc get --show-kind --ignore-not-found -n istio-system inferenceservices.serving.kserve.io
Error from server (InternalError): Internal error occurred: error resolving resource
Example issue (resources remaining in namespace):
oc get --show-kind --ignore-not-found -n user-example-com servicemeshmembers.maistra.io
NAME CONTROL PLANE READY AGE
servicemeshmember.maistra.io/default istio-system/kubeflow False 2d23h
Confirm:
oc get -n istio-system inferenceservices.serving.kserve.io
Understand:
oc describe crd inferenceservices.serving.kserve.io
Check for invalid CA bundle:
oc get crd inferenceservices.serving.kserve.io -o yaml | grep caBundle:
(see https://access.redhat.com/solutions/6913481)
Force-Delete buggy CRD:
kubectl patch crd/inferenceservices.serving.kserve.io -p '{"metadata":{"finalizers":[]}}' --type=merge
(see kubernetes/kubernetes#60538 (comment))
Follow cleanup guidelines. (see https://docs.openshift.com/container-platform/4.11/service_mesh/v2x/removing-ossm.html#ossm-remove-cleanup_removing-ossm)
Force-Delete buggy CRD:
kubectl patch crd/servicemeshmembers.maistra.io -p '{"metadata":{"finalizers":[]}}' --type=merge
Delete buggy resources:
kubectl get ingresses.networking.internal.knative.dev --no-headers=true | awk '{print $1}' | xargs kubectl patch -p '{"metadata":{"finalizers":[]}}' --type=merge ingresses.networking.internal.knative.dev
When installing Kubeflow, you get messages like this:
Error from server (InternalError): error when creating "/opt/kubeflow/git/kubeflow-ppc64le-manifests/overlays/openshift": Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": x509: certificate signed by unknown authority
oc logs -n cert-manager cert-manager-***
gives
E1130 16:41:00.253828 1 leaderelection.go:325] error retrieving resource lock kube-system/cert-manager-controller: configmaps "cert-manager-controller" is forbidden: User "system:serviceaccount:cert-manager:cert-manager" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
oc logs -n cert-manager cert-manager-cainjector-***
gives something similar but for service account cert-manager-cainjector
.
HotFix (not recommended for production; determine more fine-grained policies):
oc adm policy add-cluster-role-to-user cluster-admin system:serviceaccount:cert-manager:cert-manager
oc adm policy add-cluster-role-to-user cluster-admin system:serviceaccount:cert-manager:cert-manager-cainjector
Trying to get a resource even on all namespaces (e.g., oc get Subscription -A
) yields:
No resources found
Even though it should be there.
Find out the API Group, e.g.:
kubectl api-resources -o wide | grep -i Subscription
This may give multiple resources with the same name, e.g.:
subscriptions sub messaging.knative.dev/v1 true Subscription [delete deletecollection get list patch create update watch]
subscriptions sub,subs operators.coreos.com/v1alpha1 true Subscription [delete deletecollection get list patch create update watch]
Use fully qualified resource name when getting resources:
oc get Subscription.operators.coreos.com -A
oc get po -n kubeflow
gives:
training-operator-... 0/1 CrashLoopBackOff
If oc describe
and oc logs
don't indicate something specific, success thresholds may be too low - see https://github.com/kubeflow/training-operator/pull/1568/files.
oc edit deploy training-operator
..and increase initialDelaySeconds
, periodSeconds
, and timeoutSeconds
for livenessProbe
and readinessProbe
.
Even after 2 login attemps, login fails
There might be a left-over from a previous user. Check that these resources look correct:
oc get Users
and
oc get Identity
Delete incorrect resources, e.g.:
oc delete identity "Local Password:[email protected]"
- Getting a
CrashLoopBackOff
, no logs, and anExit Code: 132
(whenoc describe pod XYZ
). - Getting an
Error: signal: illegal instruction (core dumped)
If you run the workload on an older architecture (e.g., Power8), try it with a newer (Power9, Power10).
- Remove old workers from your setup
- Label your workloads so they are only scheduled on newer architectures
- Down-port your workload
- Errors in:
oc logs -n kubeflow admission-webhook-deployment
- Getting a
Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?)
- Getting a
Unable to connect to the server: x509: certificate has expired or is not yet valid
(see: https://cloud.google.com/anthos/clusters/docs/bare-metal/latest/troubleshooting/expired-certs) - Getting a
multiple CertificateRequests were found for the
(whenoc logs cert-manager...
)
- Certificates may be expired. Check certificate expiration times:
echo -e "NAMESPACE\tNAME\tEXPIRY" && oc get secrets -A -o go-template='{{range .items}}{{if eq .type "kubernetes.io/tls"}}{{.metadata.namespace}}{{" "}}{{.metadata.name}}{{" "}}{{index .data "tls.crt"}}{{"\n"}}{{end}}{{end}}' | while read namespace name cert; do echo -en "$namespace\t$name\t"; echo $cert | base64 -d | openssl x509 -noout -enddate; done | column -t
(see: https://access.redhat.com/solutions/3930291; for vanilla k8s usekubeadm certs check-expiration
; see: https://cloud.google.com/anthos/clusters/docs/bare-metal/latest/troubleshooting/expired-certs) - Too many certificate requests may be created. Check number of certificate requests:
get certificaterequest -A
- cert-manager might have this issue: istio/istio#32825 (comment)
- Remove certificaterequests:
oc get certificaterequest | awk '{print $1}' | xargs oc delete certificaterequest
(see: https://www.ibm.com/mysupport/s/defect/aCI3p000000XkwVGAS/dt173371) - Add the following to the
env
property viaoc edit deploy istiod-kubeflow -n istio-system
:
- name: ENABLE_LEGACY_FSGROUP_INJECTION
value: "true"
oc login
results inUnable to connect to the server: EOF
-
Certificate is either expired or corrupt.
-
Check the pending CSR:
oc get CSR -A
-
Check the ETCD pods health:
oc get pods -n openshift-etcd
- Login via installation kubeconfig file:
export KUBECONFIG=/root/ocp4/auth/kubeconfig
oc get nodes
- Clean up all CSRs:
oc delete csr --all
- Delete the bad certificate:
oc delete secret -n openshift-config-managed kube-controller-manager-client-cert-key
- Approve the CSRs as they come in and verify that they are now "Approved,Issued":
oc get csr -o name | xargs oc adm certificate approve
oc get csr -A
- If etcd and kube-scheduler is broken patch them:
oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge