-
Notifications
You must be signed in to change notification settings - Fork 372
Description
Current Behavior
After scaling a backend Deployment up and then back down, APISIX continues to keep the removed Pod IP in the upstream/endpoints set and keeps trying to route traffic to it.
Observed timeline (UTC, sanitized IPs):
- ~11:20 UTC: scaled backend from 3 → 4 replicas for ~1 hour (new pod IP = X).
- ~12:20 UTC: scaled back down 4 → 3 (pod with IP X removed).
- After scale-down, APISIX still routed to X:8000 (stale) and produced spikes of:
- 111: Connection refused
- Many successful traces/logs during the error window showed both stale and valid upstreams at the same time, e.g. (second request succeded after retried):
-upstream_addr: "X:8000, Y:8000" - Hours later (~04:00 UTC), after restarting the APISIX deployment, the stale IP stopped appearing in upstream_addr and the errors stopped.
Expected Behavior
After the Pod is deleted and Kubernetes Endpoints/EndpointSlices are updated, APISIX should stop routing to the removed Pod IP immediately (or within the expected controller sync/update window), and should not keep stale upstream targets.
Error Logs
For some requests, we first show the (111: Connection refused) error.
2026-01-29 04:05:10.638 error 2026/01/29 03:05:10 [error] 49#49: *1392456 connect() failed (111: Connection refused) while connecting to upstream, client: XXXXXX, server: _, request: "POST /api HTTP/1.1", upstream: "http://YYYYYY:8000/api", host: "url"
And then the retried successfully request with the two IPs in the upstream_addr (staled IP + correct IP):
2026-01-29 04:05:11.430 {
"ts": "2026-01-29T03:05:11+00:00",
"service": "apisix",
"resp_body_size": "0",
"host": "url",
"address": "XXXXX",
"request_length": "595",
"method": "POST",
"uri": "/api",
"status": "204",
"user_agent": "Go-http-client/2.0",
"resp_time": "0.512",
"upstream_addr": "X:8000, Y:8000",
"upstream_status": "502, 204",
"traceparent": "00-0bc72f67dd2ee53ce6d7f03e4a4eb7d6-3a5171c169e8eb46-01",
"trace_id": "0bc72f67dd2ee53ce6d7f03e4a4eb7d6",
"span_id": "3a5171c169e8eb46",
"org_slug": "",
"matched_uri": ""
}
Steps to Reproduce
This error is not deterministic; it doesn't always happen, but if you were to try to reproduce it, this would be the closest way to do so:
- Deploy APISIX using the APISIX Helm chart v2.12.6 in standalone mode (no etcd).
- Deploy a backend service behind APISIX (any Kubernetes Service/Deployment that APISIX routes to).
- Run steady traffic through APISIX to that backend.
- Temporarily scale the backend Deployment up (e.g., from 3 replicas to 4) and keep it running for ~1 hour.
- Scale the backend back down to the original replica count (e.g., 4 → 3), so one Pod is terminated and its IP is removed.
Environment
- APISIX Helm chart 2.12.6.
- APISIX Ingress controller version: 2.0.1
- Kubernetes cluster version: 1.32
- OS version if running APISIX Ingress controller in a bare-metal environment: Linux x86_64 GNU/Linux
- ACD: 0.23.1