bug: APISIX keeps stale (deleted) Pod IP in upstream after scale-down, causing `111: Connection refused`

### Current Behavior

After scaling a backend Deployment up and then back down, APISIX continues to keep the removed Pod IP in the upstream/endpoints set and keeps trying to route traffic to it.

**Observed timeline (UTC, sanitized IPs):**
- ~11:20 UTC: scaled backend from 3 → 4 replicas for ~1 hour (new pod IP = X).
- ~12:20 UTC: scaled back down 4 → 3 (pod with IP X removed).
- After scale-down, APISIX still routed to X:8000 (stale) and produced spikes of:
  - 111: Connection refused
- Many successful traces/logs during the error window showed both stale and valid upstreams at the same time, e.g. (second request succeded after retried):
  -`upstream_addr`: "X:8000, Y:8000"
- Hours later (~04:00 UTC), after restarting the APISIX deployment, the stale IP stopped appearing in upstream_addr and the errors stopped.

### Expected Behavior

After the Pod is deleted and Kubernetes Endpoints/EndpointSlices are updated, APISIX should stop routing to the removed Pod IP immediately (or within the expected controller sync/update window), and should not keep stale upstream targets.

### Error Logs

For some requests, we first show the `(111: Connection refused)` error.
```
2026-01-29 04:05:10.638 error 2026/01/29 03:05:10 [error] 49#49: *1392456 connect() failed (111: Connection refused) while connecting to upstream, client: XXXXXX, server: _, request: "POST /api HTTP/1.1", upstream: "http://YYYYYY:8000/api", host: "url" 
```
And then the retried successfully request with the two IPs in the `upstream_addr` (staled IP + correct IP):
```
2026-01-29 04:05:11.430 {
  "ts": "2026-01-29T03:05:11+00:00",
  "service": "apisix",
  "resp_body_size": "0",
  "host": "url",
  "address": "XXXXX",
  "request_length": "595",
  "method": "POST",
  "uri": "/api",
  "status": "204",
  "user_agent": "Go-http-client/2.0",
  "resp_time": "0.512",
  "upstream_addr": "X:8000, Y:8000",
  "upstream_status": "502, 204",
  "traceparent": "00-0bc72f67dd2ee53ce6d7f03e4a4eb7d6-3a5171c169e8eb46-01",
  "trace_id": "0bc72f67dd2ee53ce6d7f03e4a4eb7d6",
  "span_id": "3a5171c169e8eb46",
  "org_slug": "",
  "matched_uri": ""
} 
```

### Steps to Reproduce

This error is not deterministic; it doesn't always happen, but if you were to try to reproduce it, this would be the closest way to do so:
- Deploy APISIX using the APISIX Helm chart v2.12.6 in standalone mode (no etcd).
- Deploy a backend service behind APISIX (any Kubernetes Service/Deployment that APISIX routes to).
- Run steady traffic through APISIX to that backend.
- Temporarily scale the backend Deployment up (e.g., from 3 replicas to 4) and keep it running for ~1 hour.
- Scale the backend back down to the original replica count (e.g., 4 → 3), so one Pod is terminated and its IP is removed.

### Environment

- APISIX Helm chart 2.12.6.
- APISIX Ingress controller version: 2.0.1
- Kubernetes cluster version: 1.32
- OS version if running APISIX Ingress controller in a bare-metal environment: Linux x86_64 GNU/Linux
- ACD: 0.23.1 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: APISIX keeps stale (deleted) Pod IP in upstream after scale-down, causing `111: Connection refused` #2708

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: APISIX keeps stale (deleted) Pod IP in upstream after scale-down, causing 111: Connection refused #2708

Description

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

bug: APISIX keeps stale (deleted) Pod IP in upstream after scale-down, causing `111: Connection refused` #2708