-
Notifications
You must be signed in to change notification settings - Fork 758
Add nexthop group usage check in route_check #4137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add nexthop group usage check in route_check #4137
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a nexthop group usage monitoring feature to the route_check script to prevent resource exhaustion issues. The implementation retrieves CRM (Critical Resource Monitoring) statistics from the COUNTERS_DB and alerts when nexthop group usage exceeds 80%, helping detect similar issues that previously caused traffic impact due to excessive nexthop group consumption.
Key Changes:
- Added CRM-based nexthop group usage monitoring with an 80% threshold check
- Integrated the check into the existing route validation flow in
check_routes_for_namespace() - Added support for COUNTERS_DB access to retrieve CRM statistics
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@xwjiang-ms how this trigger an alert? By syslog err? |
|
could you also check the UT failures? |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
@StormLiangMS the result will be collected to results array and return -1 if results has any contents |
|
@xwjiang-ms could you check the PR failures? |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw run |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
8302edc to
a09ed02
Compare
Signed-off-by: xiaweijiang <[email protected]>
Signed-off-by: xiaweijiang <[email protected]>
Signed-off-by: xiaweijiang <[email protected]>
Signed-off-by: xiaweijiang <[email protected]>
c9e32e6 to
1c143fd
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: xiaweijiang <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: xiaweijiang <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: xiaweijiang <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: xiaweijiang <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: xiaweijiang <[email protected]>
Signed-off-by: xiaweijiang <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
scripts/route_check.py
Outdated
| t2_miss = [] | ||
| t1_len = len(t1); | ||
| t2_len = len(t2); | ||
| t1_len = len(t1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xwjiang-ms why touch this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, this PR shouldn't contain the change
Signed-off-by: xiaweijiang <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
StormLiangMS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
How is this different from CRM monitoring reporting "threshold exceeded" error thats also spewed periodically? |
What I did
Previously, due to a bug in Broadcom SAI, the system incorrectly created next-hop groups with a size of 16.
This resulted in excessive next-hop group consumption, which eventually limited the number of available ECMP routes and caused traffic impact.
To prevent similar issues from going unnoticed, I added a next-hop group usage check to route_check.
If the next-hop group usage exceeds 80%, the script will report an error.
How I did it
Get CRM stats from counters DB, and get nexthop usage, then compare the usage with threshold, report error if usage exceeded threshold.
Fixed some format error.
How to verify it
Verified on lab device.
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)