Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update rule management to avoid sporadic 503 errors #4039

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

shraddhabang
Copy link
Collaborator

@shraddhabang shraddhabang commented Feb 4, 2025

Issue

#3816

Description

The root cause for the above issue was identified as a race condition in the ALB's rule management process. The controller uses ModifyRule API sequentially to add any new rule at the existing priority which results in loss of the existing older rule at that priority. When the first API call to modify rules is received, the ALB waits 10 seconds to batch all subsequent changes before deploying them to the data plane. If some rule modification APIs arrive after the initial 10-second window, they are implemented in the next batch, leading to a brief period where a random rule may not exist, causing 404 errors.

The proposed improvement in the rule management logic make use of SetRulePriorities API to first push the unwanted rules down in the listener instead of deleting them. This creates the gaps in the rules order to add any new/updated rule. Once the new rule is successfully added, the controller will then delete all the unwanted rules. This way we ensure that no rules will be lost before the new/updated rules are present in the controller. This will avoid any sporadic 404 errors caused due to race conditions.

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the docs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 4, 2025
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2025
"resourceID", resLR.ID(),
"arn", awssdk.ToString(sdkLR.ListenerRule.RuleArn))
return nil
func (m *defaultListenerRuleManager) SetRulePriorities(ctx context.Context, unmatchedSDKLRs []ListenerRuleWithTags, lastAvailablePriority int32) (int32, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unmatchedSDKLRs seems to be the wrong name. Aren't these listener rules that are matched but just at the wrong priority?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read further on in the implementation seems we use this function for matched and unmatched rules.

Copy link
Collaborator

@wweiwei-li wweiwei-li Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move "elasticloadbalancing:SetRulePriorities" to where we have condition, like other modify calls

"Condition": {
    "Null": {
        "aws:ResourceTag/elbv2.k8s.aws/cluster": "false"
    }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I agree. Nice catch fixed now. :)


matchedResAndSDKLRs, unmatchedResLRs, unmatchedSDKLRs := matchResAndSDKListenerRules(resLRs, sdkLRs)
// matchedResAndSDKLRsBySettings : A slice of matched resLR and SDKLR rule pairs that have matching settings like actions and conditions
// unmatchedResLRs : A slice of resLR) that do not have a corresponding match in the sdkLRs. These rules need to be created on the load balancer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resLR) typo

Copy link
Collaborator Author

@shraddhabang shraddhabang Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. Thanks :)

if err != nil {
return err
}
resLR.SetStatus(lrStatus)
}
for _, resAndSDKLR := range matchedResAndSDKLRs {
// Update existing listener rules on the load balancer for their tags
for _, resAndSDKLR := range matchedResAndSDKLRsBySettings {
lsStatus, err := s.lrManager.Update(ctx, resAndSDKLR.resLR, resAndSDKLR.sdkLR)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this lrManager.Update is used only for update tags with your change ? should we change the method name to make it clear

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Updated.

sdkLR: sdkLR,
})
}
unmatchedSDKLRs = append(unmatchedSDKLRs[:i], unmatchedSDKLRs[i+1:]...)
Copy link
Collaborator

@zac-nixon zac-nixon Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I am not a fan of modifying the input of a function like this. I see that we do this is in other parts of the code but overall I think it's an anti-pattern. Using this pattern also means you have to do multiple modification to i to get the index correct.

I would prefer to just allocate a new array to use.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done this to avoid any duplicate desired res rules matching the settings to be paired to same sdk rule.

break
}
}
if !found {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will still result in 503s (please correct me if I'm wrong)

Basically, from my read of the code, when a user changes their conditions or actions we will delete the ListenerRule and re-create the rule. If we're unlucky, it's possible for the delete to get propagated and the create to be delayed a bit which causes the 503 errors.

Ideally, we can use modify-rule https://docs.aws.amazon.com/cli/latest/reference/elbv2/modify-rule.html to detect these rules with the changed actions / conditions and modify the rule in place, rather than delete and create the rule.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. I have updated the code to avoid such scenarios. We will first push down all the unmatched SDK LRs which are not matching any priority or settings down. We will reprioritize the rules which are matching the settings to match their desired priorities and then we will either modify/create the rules at their desired priorities and in the end delete all the unwanted pushed down rules. This will we will have have rules on the listener before we get a newer/updated one at higher priority.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 25, 2025
sdkLRs = append(sdkLRs, sdkLR)
lastAvailablePriority--
}
//Reprioratize matched rules by settings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Reprioritize

desiredActions, err := buildSDKActions(resLR.Spec.Actions, m.featureGates)
if err != nil {
func (m *defaultListenerRuleManager) SetRulePriorities(ctx context.Context, matchedResAndSDKLRsBySettings []resAndSDKListenerRulePair, unmatchedSDKLRs []ListenerRuleWithTags) error {
var lastAvailablePriority int32 = 50000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my own understanding, it's impossible for a user to model a rule with 50k priority in K8s. It would have to be done via an out of band approach therefore would also be eligible for removal

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

Copy link
Collaborator

@zac-nixon zac-nixon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this before, but we should have some unit tests here.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shraddhabang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 28, 2025
@shraddhabang
Copy link
Collaborator Author

Sorry I missed this before, but we should have some unit tests here.

Done. Please take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants