Skip to content

bug: APISIX keeps stale (deleted) Pod IP in upstream after scale-down, causing 111: Connection refused #2708

@vortegatorres

Description

@vortegatorres

Current Behavior

After scaling a backend Deployment up and then back down, APISIX continues to keep the removed Pod IP in the upstream/endpoints set and keeps trying to route traffic to it.

Observed timeline (UTC, sanitized IPs):

  • ~11:20 UTC: scaled backend from 3 → 4 replicas for ~1 hour (new pod IP = X).
  • ~12:20 UTC: scaled back down 4 → 3 (pod with IP X removed).
  • After scale-down, APISIX still routed to X:8000 (stale) and produced spikes of:
    • 111: Connection refused
  • Many successful traces/logs during the error window showed both stale and valid upstreams at the same time, e.g. (second request succeded after retried):
    -upstream_addr: "X:8000, Y:8000"
  • Hours later (~04:00 UTC), after restarting the APISIX deployment, the stale IP stopped appearing in upstream_addr and the errors stopped.

Expected Behavior

After the Pod is deleted and Kubernetes Endpoints/EndpointSlices are updated, APISIX should stop routing to the removed Pod IP immediately (or within the expected controller sync/update window), and should not keep stale upstream targets.

Error Logs

For some requests, we first show the (111: Connection refused) error.

2026-01-29 04:05:10.638 error 2026/01/29 03:05:10 [error] 49#49: *1392456 connect() failed (111: Connection refused) while connecting to upstream, client: XXXXXX, server: _, request: "POST /api HTTP/1.1", upstream: "http://YYYYYY:8000/api", host: "url" 

And then the retried successfully request with the two IPs in the upstream_addr (staled IP + correct IP):

2026-01-29 04:05:11.430 {
  "ts": "2026-01-29T03:05:11+00:00",
  "service": "apisix",
  "resp_body_size": "0",
  "host": "url",
  "address": "XXXXX",
  "request_length": "595",
  "method": "POST",
  "uri": "/api",
  "status": "204",
  "user_agent": "Go-http-client/2.0",
  "resp_time": "0.512",
  "upstream_addr": "X:8000, Y:8000",
  "upstream_status": "502, 204",
  "traceparent": "00-0bc72f67dd2ee53ce6d7f03e4a4eb7d6-3a5171c169e8eb46-01",
  "trace_id": "0bc72f67dd2ee53ce6d7f03e4a4eb7d6",
  "span_id": "3a5171c169e8eb46",
  "org_slug": "",
  "matched_uri": ""
} 

Steps to Reproduce

This error is not deterministic; it doesn't always happen, but if you were to try to reproduce it, this would be the closest way to do so:

  • Deploy APISIX using the APISIX Helm chart v2.12.6 in standalone mode (no etcd).
  • Deploy a backend service behind APISIX (any Kubernetes Service/Deployment that APISIX routes to).
  • Run steady traffic through APISIX to that backend.
  • Temporarily scale the backend Deployment up (e.g., from 3 replicas to 4) and keep it running for ~1 hour.
  • Scale the backend back down to the original replica count (e.g., 4 → 3), so one Pod is terminated and its IP is removed.

Environment

  • APISIX Helm chart 2.12.6.
  • APISIX Ingress controller version: 2.0.1
  • Kubernetes cluster version: 1.32
  • OS version if running APISIX Ingress controller in a bare-metal environment: Linux x86_64 GNU/Linux
  • ACD: 0.23.1

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions