Skip to content

Security group limit error blocks reconciliation without visible Ingress status #4416

@abbassoltanian-usmobile

Description

Description

When the AWS Load Balancer Controller encounters a RulesPerSecurityGroupLimitExceeded error while trying to add security group rules, it fails silently without updating the Ingress resource status. This makes it very difficult to diagnose the issue.

Current Behavior

  1. Controller logs show: Reconciler error... RulesPerSecurityGroupLimitExceeded
  2. Controller retries indefinitely (every ~2 minutes)
  3. Ingress resource shows NO error condition (kubectl describe ingress shows nothing wrong)
  4. ALB listener rules are NOT updated (blocked by security group failure)
  5. Users have no indication that reconciliation is failing

Expected Behavior

The controller should:

  1. Update the Ingress resource status with a condition indicating the error
  2. Controller should emit a Kubernetes Event visible via kubectl get events:
   Warning  ReconcileFailed  Security group rule limit exceeded
  1. Stop retrying after N attempts (or use exponential backoff)
  2. Provide actionable error message (e.g., "Security group limit reached. Remove X rules or request quota increase")

Impact

  • Silent failures: Changes to Ingress hosts don't take effect
  • Difficult debugging: Users must check controller logs to find the issue
  • No alerting: Monitoring systems can't detect the problem via Ingress status

Steps to Reproduce

  1. Create an Ingress with security group rules that approaches AWS quota limit (increased from default 60 to 150 rules)
  2. Reference a managed prefix list with high max capacity (e.g., CloudFront prefix list pl-b6a144df with 55 max capacity)
  3. Update Ingress to add another managed prefix list or change hosts
  4. Observe controller logs showing RulesPerSecurityGroupLimitExceeded error
  5. Run kubectl describe ingress <name> - shows no error condition
  6. Check ALB console - listener rules are not updated

Frequency: Always (100% reproducible when hitting the limit)
User visibility: Zero - no indication in kubectl describe ingress or Events
Only discoverable by manually checking controller logs

Environment

  • Controller Version: v2.13.4
  • AWS Region: us-east-2
  • Kubernetes version: 1.28+ (EKS)
  • Installation Method: [Helm/Terraform/etc]
  • Using Service or Ingress: Ingress

Logs

{"level":"error","ts":"2025-10-27T16:55:55Z","msg":"Reconciler error","controller":"ingress","object":{"name":"nonprod-cloudflare","namespace":"xxxx"},"namespace":"xxxx","name":"nonprod-cloudflare","reconcileID":"b3f58617-2002-407a-9343-0d6c94219c5d","error":"operation error EC2: AuthorizeSecurityGroupIngress, https response error StatusCode: 400, RequestID: 7541c0b9-2150-4c8b-a8f0-2aa1f7855fe7, api error RulesPerSecurityGroupLimitExceeded: The maximum number of rules per security group has been reached."}

Proposed Solution

  1. Add a Condition to the Ingress status:

    status:
      conditions:
      - type: Reconciled
        status: False
        reason: SecurityGroupLimitExceeded
        message: "Cannot add security group rules: limit of X reached. Current: Y, attempting to add: Z (prefix list max capacity). Remove unused rules or request quota increase."
  2. Emit a Kubernetes Event:

    kubectl get events | grep ingress
    # Warning  ReconcileFailed  Security group sg-xxx reached rule limit (X)
  3. Implement retry backoff to reduce log noise

Additional Context

AWS counts managed prefix lists by their maximum capacity, not current entries. This means a prefix list with 46 current entries but 55 max capacity counts as 55 rules. The controller should communicate this clearly.

Metadata

Metadata

Assignees

Labels

triage/unresolvedIndicates an issue that can not or will not be resolved.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions