Welcome to the Kubernetes Troubleshooting Scenarios Simulator! This repository contains 35 real-world Kubernetes issues that you can simulate, analyze, and resolve. Whether you're a beginner or an experienced Kubernetes professional, this project will help you gain hands-on troubleshooting experience in Kubernetes environments.
- 35 real-world scenarios: Each scenario is designed to reflect actual Kubernetes issues you may encounter in production environments.
- Step-by-step troubleshooting: Follow clear instructions for simulating issues, investigating problems, and applying solutions.
- Hands-on practice: Learn by doing! Build your troubleshooting skills in a real Kubernetes environment.
- Docker integration: Learn how to build and run Docker containers for each scenario as needed.
- 35 Kubernetes troubleshooting scenarios covering networking, storage, security, probes, scheduling, resources, and more.
- No custom Docker images needed — all scenarios use standard public images (
busybox,nginx:alpine,polinux/stress). - Observable failures — 30 scenarios produce clear failure states (Pending, CrashLoopBackOff, OOMKilled, Error) visible via
kubectl get pods. - Interactive scenario runner (
manage-scenarios.sh) to simulate, investigate, and fix issues step by step. - Practical solutions and explanations to help you understand the root causes of issues.
-
Kubernetes Cluster:
- Ensure you have access to a Kubernetes cluster. You can use Minikube, Docker Desktop, or a cloud-managed service like AWS EKS, GKE, or AKS.
-
Install kubectl:
kubectlis the Kubernetes command-line tool that you'll use to interact with your cluster.- Download and install it from the official Kubernetes site.
- To verify installation:
kubectl version --client
-
Install Docker:
- Install Docker to build images locally or on your preferred CI/CD pipeline.
- Download Docker from the official Docker site.
-
Clone the Repository:
- Clone the repository to your local machine:
git clone https://github.com/vellankikoti/troubleshoot-kubernetes-like-a-pro.git cd troubleshoot-kubernetes-like-a-pro
- Clone the repository to your local machine:
In the root directory of the repository, you will find a script to help manage the scenarios:
chmod +x manage-scenarios.shStart the scenario management script:
./manage-scenarios.shThe script will display a list of 35 scenarios. Each scenario corresponds to a real-world issue in Kubernetes. Enter the scenario number to simulate and resolve the issue.
For example, enter 1 to simulate the "Affinity Rules Violation."
The script will apply the issue.yaml file to simulate the problem in your Kubernetes cluster. You can inspect the issue with:
kubectl describe pod <pod-name>
kubectl logs <pod-name>After analyzing the issue, the script will guide you to apply the corresponding fix.yaml file to resolve the issue:
kubectl apply -f fix.yamlVerify that the issue is resolved by checking the pod's status:
kubectl get podstroubleshoot-kubernetes-like-a-pro/
├── scenarios/
│ ├── affinity-rules-violation/
│ │ ├── description.md # Explains the issue, causes, and fix
│ │ ├── issue.yaml # Creates the broken state
│ │ └── fix.yaml # Resolves the issue
│ ├── crashloopbackoff/
│ ├── dns-resolution-failure/
│ └── ... (35 scenarios)
├── docs/
│ ├── pod-lifecycle.md
│ └── troubleshooting-guide.md
├── scripts/
│ └── manage-scenarios.sh
├── manage-scenarios.sh # Interactive scenario runner
└── README.md
- scenarios/: Contains 35 scenario folders, each with
issue.yaml,fix.yaml, anddescription.md. - docs/: Supplementary guides on pod lifecycle and troubleshooting.
- scripts/: Contains the scenario management script.
The repository includes the following 35 Kubernetes troubleshooting scenarios:
| # | Scenario | Failure Type | Observable State |
|---|---|---|---|
| 1 | Affinity Rules Violation | Scheduling | Pod Pending |
| 2 | CGroup Issues | Resource / OOM | OOMKilled / CrashLoopBackOff |
| 3 | Cluster Autoscaler Issues | Scheduling | Pods Pending (100 replicas) |
| 4 | Container Runtime (CRI) Errors | Runtime | RuntimeClass not found |
| 5 | Crash Due to Insufficient Disk Space | Storage | Pod Evicted |
| 6 | CrashLoopBackOff | Application | CrashLoopBackOff |
| 7 | Disk IO Errors | Storage | Container Error |
| 8 | DNS Resolution Failure | Networking | DNS timeout (pod runs but DNS fails) |
| 9 | Failed Resource Limits | Resource / OOM | OOMKilled / CrashLoopBackOff |
| 10 | File Permissions on Mounted Volumes | Storage / Security | Read-only filesystem error |
| 11 | Firewall Restriction | Networking | Egress blocked by NetworkPolicy |
| 12 | Image Pull Backoff | Image | ImagePullBackOff |
| 13 | Image Pull Error | Image | ErrImagePull |
| 14 | Ingress Configuration Issue | Networking | Misconfigured Ingress |
| 15 | Insufficient Resources | Scheduling | Pod Pending |
| 16 | Liveness Probe Failure | Probes | CrashLoopBackOff (probe fails) |
| 17 | Liveness & Readiness Failure | Probes | CrashLoopBackOff + Not Ready |
| 18 | LoadBalancer Service Misconfiguration | Networking | Service targets wrong selector |
| 19 | Network Connectivity Issues | Networking | Egress blocked by NetworkPolicy |
| 20 | Node Affinity Issue | Scheduling | Pod Pending |
| 21 | OOM Killed | Resource / OOM | OOMKilled / CrashLoopBackOff |
| 22 | Outdated Kubernetes Version | Best Practice | Running (educational) |
| 23 | Persistent Volume Claim Issues | Storage | Pod Pending (PVC unbound) |
| 24 | PID Namespace Collision | Security | Running with hostPID (security risk) |
| 25 | Pod Disruption Budget Violations | Availability | PDB blocks disruption |
| 26 | Port Binding Issues | Networking | Port conflict |
| 27 | Readiness Probe Failure | Probes | Running but Not Ready (0/1) |
| 28 | Resource Requests & Limits Mismatch | Resource | API rejection (limit < request) |
| 29 | Security Context Issues | Best Practice | Running as root (educational) |
| 30 | SELinux/AppArmor Policy Violation | Best Practice | Running (educational) |
| 31 | Service Account Permissions Issue | RBAC | Pod creation fails (SA not found) |
| 32 | Service Port Mismatch | Networking | Service port mismatch |
| 33 | Taints and Tolerations Mismatch | Scheduling | Pod Pending |
| 34 | Volume Mount Issue | Storage | API rejection (volume not defined) |
| 35 | Wrong Container Command | Application | RunContainerError |
-
Hard Failures (30 scenarios): These produce a clear, observable failure state (Pending, CrashLoopBackOff, OOMKilled, Error, etc.) that you can diagnose with
kubectl get pods,kubectl describe, andkubectl logs. -
Best Practice / Educational (3 scenarios): Scenarios #22 (Outdated K8s Version), #29 (Security Context), and #30 (SELinux/AppArmor) are configuration best-practice demos. Both the issue and fix pods will show as Running — the learning is in understanding the security implications of the configuration difference.
-
CNI-Dependent (2 scenarios): Scenarios #11 (Firewall Restriction) and #19 (Network Connectivity) use Kubernetes NetworkPolicy to simulate blocked traffic. These require a CNI plugin that supports NetworkPolicy enforcement (e.g., Calico, Cilium, Weave Net). They will not demonstrate the issue on clusters using basic flannel or Docker Desktop’s default CNI.
-
Always check the logs and describe the pod to identify the issue:
kubectl logs <pod-name> kubectl describe pod <pod-name>
-
If a fix doesn’t resolve the issue, verify cluster configurations and try reapplying the scenario.
-
For NetworkPolicy scenarios (#11, #19), ensure your cluster has a CNI that supports NetworkPolicy (e.g., Calico, Cilium).
Contributions are welcome! Feel free to:
- Add new scenarios: If you have an interesting or challenging Kubernetes issue, contribute by adding it to this repo.
- Improve existing scenarios: Fix bugs, improve documentation, or suggest enhancements.
To contribute, please submit a pull request with your changes.
This project is licensed under the MIT License.
This project is maintained by Koti. Thank you for exploring the Kubernetes Troubleshooting Scenarios Simulator!