-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Description
When the AWS Load Balancer Controller encounters a RulesPerSecurityGroupLimitExceeded error while trying to add security group rules, it fails silently without updating the Ingress resource status. This makes it very difficult to diagnose the issue.
Current Behavior
- Controller logs show:
Reconciler error... RulesPerSecurityGroupLimitExceeded - Controller retries indefinitely (every ~2 minutes)
- Ingress resource shows NO error condition (
kubectl describe ingressshows nothing wrong) - ALB listener rules are NOT updated (blocked by security group failure)
- Users have no indication that reconciliation is failing
Expected Behavior
The controller should:
- Update the Ingress resource status with a condition indicating the error
- Controller should emit a Kubernetes Event visible via kubectl get events:
Warning ReconcileFailed Security group rule limit exceeded
- Stop retrying after N attempts (or use exponential backoff)
- Provide actionable error message (e.g., "Security group limit reached. Remove X rules or request quota increase")
Impact
- Silent failures: Changes to Ingress hosts don't take effect
- Difficult debugging: Users must check controller logs to find the issue
- No alerting: Monitoring systems can't detect the problem via Ingress status
Steps to Reproduce
- Create an Ingress with security group rules that approaches AWS quota limit (increased from default 60 to 150 rules)
- Reference a managed prefix list with high max capacity (e.g., CloudFront prefix list pl-b6a144df with 55 max capacity)
- Update Ingress to add another managed prefix list or change hosts
- Observe controller logs showing
RulesPerSecurityGroupLimitExceedederror - Run
kubectl describe ingress <name>- shows no error condition - Check ALB console - listener rules are not updated
Frequency: Always (100% reproducible when hitting the limit)
User visibility: Zero - no indication in kubectl describe ingress or Events
Only discoverable by manually checking controller logs
Environment
- Controller Version: v2.13.4
- AWS Region: us-east-2
- Kubernetes version: 1.28+ (EKS)
- Installation Method: [Helm/Terraform/etc]
- Using Service or Ingress: Ingress
Logs
{"level":"error","ts":"2025-10-27T16:55:55Z","msg":"Reconciler error","controller":"ingress","object":{"name":"nonprod-cloudflare","namespace":"xxxx"},"namespace":"xxxx","name":"nonprod-cloudflare","reconcileID":"b3f58617-2002-407a-9343-0d6c94219c5d","error":"operation error EC2: AuthorizeSecurityGroupIngress, https response error StatusCode: 400, RequestID: 7541c0b9-2150-4c8b-a8f0-2aa1f7855fe7, api error RulesPerSecurityGroupLimitExceeded: The maximum number of rules per security group has been reached."}
Proposed Solution
-
Add a
Conditionto the Ingress status:status: conditions: - type: Reconciled status: False reason: SecurityGroupLimitExceeded message: "Cannot add security group rules: limit of X reached. Current: Y, attempting to add: Z (prefix list max capacity). Remove unused rules or request quota increase."
-
Emit a Kubernetes Event:
kubectl get events | grep ingress # Warning ReconcileFailed Security group sg-xxx reached rule limit (X)
-
Implement retry backoff to reduce log noise
Additional Context
AWS counts managed prefix lists by their maximum capacity, not current entries. This means a prefix list with 46 current entries but 55 max capacity counts as 55 rules. The controller should communicate this clearly.