Production-grade runtime security, anomaly detection, and incident response for Kubernetes. Three independent detection layers — Prometheus metrics, Loki logs, Falco syscalls — connected by Alertmanager and documented with real incident playbooks.
Quick Start • Architecture • Attack Simulation • Alert Reference • Interview Talking Points
Most Kubernetes environments have zero runtime visibility.
A compromised container can exfiltrate data, pivot laterally,
and mine crypto for weeks before anyone notices.
| Before | After |
|---|---|
| Shell spawned in container → silent | Shell spawned → Falco fires in < 1 second |
| 500 failed logins → nobody knows | 10 failed logins/2min → Slack alert + playbook link |
| Memory exhaustion → surprise outage | 75% memory threshold → warning before OOM kill |
| Pod crash loop → user reports it | 3 restarts/15min → PagerDuty page fires |
| "What do we do?" → improvised | SEC-001, SEC-002 playbooks → 15-min containment SLA |
┌─────────────────────────────────────────────────────────────────────────────────┐
│ CLOUD-NATIVE THREAT DETECTION PLATFORM │
│ Kubernetes Namespace: threat-detection │
└─────────────────────────────────────────────────────────────────────────────────┘
╔═══════════════════════════════════════════════════════════════════════════════╗
║ ATTACK SIMULATION LAYER ║
║ ┌──────────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ ║
║ │ brute_force.py │ │ cpu_spike.py │ │ memory_ex.py │ │ kill_chain │ ║
║ │ T1110 BruteForce│ │ T1499 DoS │ │ T1499.004 │ │ Full APT sim│ ║
║ └────────┬─────────┘ └──────┬───────┘ └──────┬───────┘ └──────┬──────┘ ║
╚═══════════╪════════════════════╪═════════════════╪══════════════════╪════════╝
│ HTTP Requests │ │ │
╔═══════════▼════════════════════▼═════════════════▼══════════════════▼════════╗
║ APPLICATION LAYER (Python Flask + Gunicorn) ║
║ ║
║ /health /ready /metrics /login /load /memory /exec /probe ║
║ ║
║ UID 1001 │ ReadOnlyRootFS │ No SA token │ Drop ALL caps │ Seccomp ║
║ Prometheus metrics client → Counter, Gauge, Histogram ║
║ Structured JSON logs → stdout → captured by Promtail ║
╚══════════════════════════════╤═══════════════════════════════════════════════╝
│
┌──────────────────┴──────────────────┐
│ │
╔═══════════▼═══════════════╗ ╔═════════════▼═════════════╗
║ METRICS PIPELINE ║ ║ LOGGING PIPELINE ║
║ ║ ║ ║
║ ┌─────────────────────┐ ║ ║ ┌─────────────────────┐ ║
║ │ Prometheus │ ║ ║ │ Promtail │ ║
║ │ Scrapes /metrics │ ║ ║ │ DaemonSet per node │ ║
║ │ every 15 seconds │ ║ ║ │ Pipeline stages │ ║
║ │ 15-day retention │ ║ ║ │ Drop probe noise │ ║
║ └──────────┬──────────┘ ║ ║ └──────────┬──────────┘ ║
║ │ Evaluates ║ ║ │ Ships to ║
║ │ 12 rules ║ ║ ┌──────────▼──────────┐ ║
║ ┌──────────▼──────────┐ ║ ║ │ Loki │ ║
║ │ Alertmanager │ ║ ║ │ Label-indexed logs │ ║
║ │ Routing by │◄─╫───────╫──│ 4 LogQL alert rules│ ║
║ │ severity + team │ ║ ║ │ 30-day retention │ ║
║ │ Dedup + Inhibition │ ║ ║ └─────────────────────┘ ║
║ └──────────┬──────────┘ ║ ╚═══════════════════════════╝
╚═════════════╪═════════════╝
│
╔═════════════▼═══════════════════════════════════════════════════════════════╗
║ NOTIFICATION LAYER ║
║ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────────────┐ ║
║ │ Slack Webhook │ │ PagerDuty │ │ Email (SMTP) │ ║
║ │ #sec-incidents │ │ Critical pages │ │ Email (configurable) │ ║
║ │ #platform-ops │ │ 30min re-alert │ │ HTML template │ ║
║ └─────────────────┘ └──────────────────┘ └───────────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════════════════╗
║ RUNTIME SECURITY LAYER ── Falco eBPF Syscall Interception ║
║ ║
║ Every node │ Every container │ Every syscall ║
║ ║
║ exec() → shell_spawned_in_container (T1059.004) CRITICAL ║
║ connect() → unexpected_outbound_connection (T1071) HIGH ║
║ open(WRITE) → write_sensitive_file (T1222) CRITICAL ║
║ setuid() → privilege_escalation_attempt (T1068) CRITICAL ║
║ open(/proc) → proc_filesystem_access (T1057) CRITICAL ║
║ ║
║ Falco → Falcosidekick → Alertmanager + Loki + Slack ║
╚═════════════════════════════════════════════════════════════════════════════╝
╔═════════════════════════════════════════════════════════════════════════════╗
║ ENFORCEMENT LAYER ║
║ ┌──────────────────────┐ ┌─────────────────────┐ ┌──────────────────┐ ║
║ │ NetworkPolicy │ │ ResourceQuota │ │ Pod Security │ ║
║ │ Default deny all │ │ 4 CPU / 4Gi hard │ │ Admission │ ║
║ │ 8 allowlist rules │ │ 20 pod limit │ │ restricted │ ║
║ │ Zero-trust east- │ │ No LoadBalancer │ │ profile enforced│ ║
║ │ west traffic │ │ No NodePort │ │ on namespace │ ║
║ └──────────────────────┘ └─────────────────────┘ └──────────────────┘ ║
╚═════════════════════════════════════════════════════════════════════════════╝
ATTACK OCCURS
│
▼
──────────────────────────────────────────────────────────────
LAYER 1 │ METRIC SIGNAL ~15-30s
──────────────────────────────────────────────────────────────
Flask Prometheus client increments counter
failed_logins_total{source_ip="x.x.x.x"} += 1
Prometheus scrapes /metrics every 15s
Alert rule evaluates: rate(failed_logins_total[2m]) > 10
PENDING → FIRING after `for:` duration
│
▼
──────────────────────────────────────────────────────────────
LAYER 2 │ LOG SIGNAL ~5-15s
──────────────────────────────────────────────────────────────
Structured log emitted:
AUTHENTICATION_FAILURE user=admin source_ip=x.x.x.x
Promtail pipeline extracts label: event_type=AUTHENTICATION_FAILURE
Loki ingests log stream with labels
LogQL rule fires: count_over_time > threshold
Loki ruler → Alertmanager → second correlated alert
│
▼
──────────────────────────────────────────────────────────────
LAYER 3 │ RUNTIME SIGNAL (exec/file/network) ~< 1s
──────────────────────────────────────────────────────────────
eBPF probe intercepts exec() syscall
Falco matches rule: shell_spawned_in_container
Falcosidekick fans out to Alertmanager + Loki + Slack
Correlation: same pod, overlapping time window
│
▼
──────────────────────────────────────────────────────────────
RESPONSE │ SOC ACTION
──────────────────────────────────────────────────────────────
Analyst receives Slack alert with direct playbook link
Opens Grafana: correlates metrics + logs + Falco events
Executes playbook: isolate → preserve evidence → eradicate
MTTC target: 15 minutes from first alert
cloud-threat-detection/
│
├── 📦 app/
│ ├── app.py # Flask app — 10 endpoints, Prometheus metrics, attack surfaces
│ └── requirements.txt # Pinned Python dependencies
│
├── 🐳 docker/
│ ├── Dockerfile # Multi-stage build, UID 1001, readOnlyRootFS, health checks
│ └── .dockerignore
│
├── ☸️ k8s/
│ ├── namespace.yaml # PSA restricted enforcement
│ ├── serviceaccount.yaml # No API token mounted
│ ├── configmap.yaml
│ ├── deployment.yaml # Full securityContext, probes, resource limits
│ ├── service.yaml # ClusterIP only (no external exposure)
│ ├── network-policy.yaml # Default-deny + 8 allowlist rules
│ └── resource-quota.yaml # Namespace CPU/memory/object caps
│
├── 📊 monitoring/
│ ├── prometheus/
│ │ ├── prometheus-config.yaml # Scrape configs, pod discovery, self-monitoring
│ │ ├── alert-rules.yaml # 12 production alert rules across 6 groups
│ │ └── prometheus-deployment.yaml
│ │
│ ├── alertmanager/
│ │ ├── alertmanager-config.yaml # Routing tree, receivers, inhibition rules
│ │ └── alertmanager-deployment.yaml # Webhook simulator included
│ │
│ ├── loki/
│ │ ├── loki-config.yaml # Loki + 4 LogQL alert rules
│ │ └── loki-deployment.yaml # Loki StatefulSet + Promtail DaemonSet
│ │
│ ├── falco/
│ │ ├── falco-rules.yaml # 9 custom rules with MITRE ATT&CK mapping
│ │ └── falco-deployment.yaml # DaemonSet + Falcosidekick + RBAC
│ │
│ └── grafana/
│ └── grafana-deployment.yaml # Datasource provisioning (Prometheus + Loki)
│
├── 💥 attacks/
│ ├── brute_force.py # Sequential + distributed credential stuffing
│ ├── cpu_spike.py # CPU exhaustion with alert monitoring
│ ├── memory_exhaustion.py # Escalating memory pressure + OOM simulation
│ └── suspicious_commands.py # Full kill chain: recon → persistence → C2
│
├── 📋 docs/
│ ├── incident-playbook-brute-force.md # SEC-001 with forensic queries + containment
│ ├── incident-playbook-container-compromise.md # SEC-002 with 15min MTTC target
│ └── threat-model.md # STRIDE + MITRE ATT&CK for Containers
│
├── docker-compose.yaml # Local dev stack (no K8s required)
├── Makefile # deploy / attack / port-forward / verify targets
└── README.md
| Tool | Version | Purpose |
|---|---|---|
kubectl |
1.28+ | Cluster management |
helm |
3.x | Falco deployment |
python3 |
3.10+ | Attack simulation scripts |
docker |
24+ | Image build |
| CNI | Calico / Cilium | NetworkPolicy enforcement |
# 1. Clone
git clone https://github.com/codewithbrandon/cloud-threat-detection.git
cd cloud-threat-detection
# 2. Deploy everything with make
make deploy # namespace + monitoring + app + network policies
make deploy-falco # Falco via Helm (requires Linux node for eBPF)
# 3. Access dashboards
make port-forward
# 4. Verify the stack is healthy
make verify# Start full monitoring stack locally
docker-compose up -d
# Verify all containers running
docker-compose ps
# View app logs
docker-compose logs -f app| Service | URL | Credentials |
|---|---|---|
| Grafana | http://localhost:3000 | anonymous viewer |
| Prometheus | http://localhost:9090 | none |
| Alertmanager | http://localhost:9093 | none |
| Application | http://localhost:8080 | — |
All scripts are safe — they target the local application only and generate the syscall/log/metric patterns that detection rules look for.
# Sequential: single IP rapid fire (tests per-IP Prometheus threshold)
python3 attacks/brute_force.py \
--target http://localhost:8080 \
--mode sequential --count 25 --rate 5
# Distributed: multiple IPs (tests global Loki LogQL threshold)
python3 attacks/brute_force.py \
--target http://localhost:8080 \
--mode distributed --count 60 --concurrency 4Fires: ExcessiveFailedLogins → BruteForceAttackCritical → BruteForceInLogs
Verify:
# Prometheus
curl -s http://localhost:9090/api/v1/query \
--data 'query=sum(increase(failed_logins_total[2m]))by(source_ip)'
# Loki (in Grafana Explore)
{app="threat-detection-app"} |= "AUTHENTICATION_FAILURE" | jsonpython3 attacks/cpu_spike.py \
--target http://localhost:8080 \
--intensity 0.9 --duration 120Fires: HighCPUUsage (>75% for 2min) → CriticalCPUSpike (>95% for 1min)
# Gradual escalation across 4 steps to 480MB (limit: 512MB)
python3 attacks/memory_exhaustion.py \
--target http://localhost:8080 \
--mode escalating --size 480 --steps 4 --hold 30Fires: HighMemoryUsage → MemoryExhaustionCritical → Kubernetes OOM kill → PodCrashLoopDetected
# Runs: recon → network discovery → persistence → C2 beaconing
python3 attacks/suspicious_commands.py \
--target http://localhost:8080 \
--scenario kill-chainFires: Falco shell_spawned_in_container + unexpected_outbound_connection + write_sensitive_file
# Watch Falco alerts in real-time
kubectl logs -n threat-detection -l app=falco -f | \
jq '{rule: .rule, priority: .priority, pod: .output_fields."k8s.pod.name"}'make attack-all| Alert | Trigger | Severity | Channel | Response SLA |
|---|---|---|---|---|
ExcessiveFailedLogins | >10 failures/2min/IP | #security-alerts | 5 min | |
BruteForceAttackCritical | >50 failures/1min/IP | 🔴 CRITICAL | #security-incidents + page | Immediate |
HighCPUUsage | CPU >75% for 2min | #platform-alerts | 15 min | |
CriticalCPUSpike | CPU >95% for 1min | 🔴 CRITICAL | #platform-oncall + page | 5 min |
HighMemoryUsage | Memory >384Mi for 2min | #platform-alerts | 15 min | |
MemoryExhaustionCritical | Memory >460Mi | 🔴 CRITICAL | #platform-oncall + page | 5 min |
High5xxErrorRate | 5xx >5% for 2min | #platform-alerts | 15 min | |
ServiceUnavailable | 5xx >50% for 1min | 🔴 CRITICAL | #platform-oncall + page | 5 min |
PodCrashLoopDetected | >3 restarts / 15min | 🔴 CRITICAL | #platform-oncall + page | 5 min |
SuspiciousActivityDetected | suspicious_activity_total > 0 | #security-alerts | 10 min | |
PrometheusTargetDown | Scrape target down 2min | 🔴 CRITICAL | #platform-oncall | 5 min |
WatchdogHeartbeat | Always firing (dead man's switch) | 🔵 NONE | External monitor | N/A |
| Rule | Syscall Trigger | MITRE Technique | Priority |
|---|---|---|---|
shell_spawned_in_container | exec() → sh/bash/zsh | T1059.004 | 🔴 CRITICAL |
unexpected_outbound_connection | connect() to non-whitelisted IP | T1071 | 🟠 HIGH |
write_sensitive_file | open(WRITE) on /etc, /bin, /usr | T1222 | 🔴 CRITICAL |
dangerous_binary_in_container | exec() → wget, curl, nc, nmap | T1105 | 🟠 HIGH |
container_running_as_root | spawned_process, UID=0 | T1078 | 🟠 HIGH |
proc_filesystem_access | open() on /proc/1, /proc/kcore | T1057 | 🔴 CRITICAL |
crypto_miner_detected | exec() → xmrig, stratum+tcp | T1496 | 🔴 CRITICAL |
k8s_secret_access_in_container | open() on /var/run/secrets | T1552 | 🔴 CRITICAL |
privilege_escalation_attempt | setuid()/setgid() succeeds | T1068 | 🔴 CRITICAL |
Container Layer
✅ Non-root user (UID 1001) ✅ Read-only root filesystem
✅ No privilege escalation ✅ Drop ALL Linux capabilities
✅ Seccomp RuntimeDefault profile ✅ Multi-stage minimal image
Pod Layer
✅ No ServiceAccount token mounted ✅ Dedicated ServiceAccount
✅ Pod Security Admission: restricted ✅ Topology spread constraints
✅ Resource limits (CPU + memory) ✅ Liveness + readiness + startup probes
Namespace Layer
✅ Default-deny NetworkPolicy ✅ 8 explicit allowlist rules
✅ ResourceQuota (CPU/mem/objects) ✅ LimitRange (per-container defaults)
✅ No LoadBalancer services ✅ No NodePort services
Runtime Layer
✅ Falco eBPF syscall monitoring ✅ 9 custom rules (MITRE-mapped)
✅ Falcosidekick fan-out routing ✅ 12 Prometheus alert rules
✅ 4 LogQL log-based alert rules ✅ Dead man's switch heartbeat
✅ Alert deduplication + inhibition ✅ Multi-channel notification
| Tactic | Techniques Covered | Detection Layer |
|---|---|---|
| Initial Access | T1190 Exploit Public-Facing App | Loki + Falco |
| Execution | T1059.004 Unix Shell | Falco (exec syscall) |
| Persistence | T1222 File Permissions Modification | Falco (open syscall) |
| Privilege Escalation | T1068, T1611 Container Escape | Falco (setuid syscall) |
| Defense Evasion | T1070 Indicator Removal | Falco (readOnlyFS blocks) |
| Credential Access | T1552 Unsecured Credentials, T1110 Brute Force | Prometheus + Loki + Falco |
| Discovery | T1057 Process Discovery | Falco (exec syscall) |
| Lateral Movement | T1210 Exploitation of Remote Services | Falco + NetworkPolicy |
| Command & Control | T1071 Application Layer Protocol | Falco + NetworkPolicy |
| Exfiltration | T1041 Exfiltration Over C2 Channel | Falco + NetworkPolicy |
| Impact | T1496 Resource Hijacking, T1499 Endpoint DoS | Prometheus |
| Playbook | Scenario | Trigger | MTTC Target |
|---|---|---|---|
| SEC-001 | Brute Force / Credential Stuffing | BruteForceAttackCritical |
N/A (alerting + block) |
| SEC-002 | Container Compromise / Runtime Attack | shell_spawned_in_container |
15 minutes |
Each playbook includes:
- Detection signal inventory
- Incident timeline template
- Triage decision tree
- Step-by-step containment commands
- Forensic evidence collection (before pod termination)
- Loki / Prometheus investigation queries
- Eradication and recovery procedures
- Lessons learned template
Walk me through your detection stack
The platform has three independent detection layers. Prometheus pulls metrics every 15 seconds — I alert on failed_logins_total crossing 10/2min per IP, which catches brute force. Loki aggregates structured logs from Promtail with pipeline stages that extract event_type labels — I use LogQL count_over_time rules for distributed attacks that stay below per-IP thresholds. Falco intercepts eBPF syscalls — exec(), connect(), and open() — which catches post-compromise behavior that metrics and logs miss entirely. All three feed Alertmanager which groups, deduplicates, and routes to Slack, PagerDuty, and email by severity.
How do you handle alert fatigue?
Three mechanisms. First, Alertmanager inhibition rules suppress warning-level alerts when a critical on the same incident is already firing — BruteForceAttackCritical inhibits ExcessiveFailedLogins. Second, grouping batches related alerts into one notification instead of 50. Third, the drop pipeline stage in Promtail filters /health, /ready, and /metrics scrape logs before ingestion — Loki alert queries don't generate noise from expected traffic patterns.
What would you add if this were production?
Three things immediately. Istio or Cilium for mTLS between pods — right now NetworkPolicy provides network-level isolation but no identity-based encryption. Image signing with Cosign and an admission webhook to reject unsigned images — this closes the supply chain gap in my STRIDE threat model. Third, Kubernetes audit log shipping to Loki — right now I detect container behavior but not API server operations like unusual RBAC changes or secret access patterns.
How do you prove this actually works?
The attack simulation scripts are the test suite. brute_force.py triggers ExcessiveFailedLogins and I time from first request to Slack notification — SLA is 2 minutes, actual is typically 35-45 seconds. suspicious_commands.py triggers all three Falco rules and I verify each appears in Falco pod logs and Alertmanager within 10 seconds. The dead man's switch WatchdogHeartbeat alert validates that the entire alerting pipeline is functional — if it stops firing, we have a bigger problem than any individual alert.
Why Loki over Elasticsearch?
Loki is label-indexed, not full-text indexed. For security use cases, I know exactly what I'm searching for — specific event types, pod names, source IPs. Loki's structured label queries are orders of magnitude cheaper at scale. It integrates natively with Prometheus labels, enabling metric-to-log correlation in Grafana without context switching. Storage cost is roughly 90% lower than Elasticsearch at equivalent log volume.
Why do you use three detection layers instead of one?
Defense in depth. An attacker who compromises the application and stops writing logs still generates syscalls that Falco catches. An attack that stays below per-IP metric thresholds still shows up in global Loki log counts. A Falco rule gap doesn't mean the attack is invisible — Prometheus captures the metric signal. Each layer has different blind spots; combining them means an attacker has to evade all three simultaneously, which is exponentially harder.
Every push and pull request runs the full CI pipeline:
| Check | Tool | What It Catches |
|---|---|---|
| YAML lint | yamllint |
Indentation, trailing spaces, type errors in all manifests |
| K8s schema validation | kubeconform |
Invalid Kubernetes API fields against the 1.29 schema |
| Python lint | ruff |
Import order, unused vars, style, pyupgrade suggestions |
| Python format | black |
Consistent code formatting — fails on drift |
| Secret scanning | gitleaks |
Accidentally committed keys, tokens, passwords |
| Container CVE scan | trivy (image) |
OS + library CVEs in the built Docker image (fails on CRITICAL) |
| Filesystem scan | trivy (fs) |
IaC misconfigurations and secrets in source (informational) |
Results from Trivy appear in the GitHub Security tab.
# Install tools once
pip install ruff==0.4.10 black==24.4.2 yamllint==1.35.1
# Install kubeconform (Linux/macOS)
curl -sSL https://github.com/yannh/kubeconform/releases/download/v0.6.7/kubeconform-linux-amd64.tar.gz \
| tar -xz -C /usr/local/bin kubeconform
# Run all checks
make lint # yaml + python
make validate-k8s # kubeconform schema validation
make lint-fix # auto-fix python formatting (writes files)Full STRIDE analysis with risk register → docs/threat-model.md
Highest residual risks (by design decision):
- Supply chain compromise — mitigated by pinned base image digests; Cosign signing is the next control
- Distributed brute force below per-IP threshold — mitigated by global Loki rule; WAF is the next control
- Falco rule gap — mitigated by metric + log parallel detection; continuous rule testing is the process control