Skip to content

Conversation

@xwjiang-ms
Copy link
Contributor

@xwjiang-ms xwjiang-ms commented Nov 27, 2025

What I did

Previously, due to a bug in Broadcom SAI, the system incorrectly created next-hop groups with a size of 16.
This resulted in excessive next-hop group consumption, which eventually limited the number of available ECMP routes and caused traffic impact.
To prevent similar issues from going unnoticed, I added a next-hop group usage check to route_check.
If the next-hop group usage exceeds 80%, the script will report an error.

How I did it

Get CRM stats from counters DB, and get nexthop usage, then compare the usage with threshold, report error if usage exceeded threshold.
Fixed some format error.

How to verify it

Verified on lab device.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

Copilot AI review requested due to automatic review settings November 27, 2025 03:47
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copilot finished reviewing on behalf of xwjiang-ms November 27, 2025 03:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a nexthop group usage monitoring feature to the route_check script to prevent resource exhaustion issues. The implementation retrieves CRM (Critical Resource Monitoring) statistics from the COUNTERS_DB and alerts when nexthop group usage exceeds 80%, helping detect similar issues that previously caused traffic impact due to excessive nexthop group consumption.

Key Changes:

  • Added CRM-based nexthop group usage monitoring with an 80% threshold check
  • Integrated the check into the existing route validation flow in check_routes_for_namespace()
  • Added support for COUNTERS_DB access to retrieve CRM statistics

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@StormLiangMS
Copy link
Contributor

@xwjiang-ms how this trigger an alert? By syslog err?

@StormLiangMS
Copy link
Contributor

could you also check the UT failures?

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xwjiang-ms
Copy link
Contributor Author

@xwjiang-ms how this trigger an alert? By syslog err?

@StormLiangMS the result will be collected to results array and return -1 if results has any contents

@StormLiangMS
Copy link
Contributor

@xwjiang-ms could you check the PR failures?

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xwjiang-ms
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xwjiang-ms xwjiang-ms force-pushed the add_next_hop_threshold_check branch from 8302edc to a09ed02 Compare December 1, 2025 23:18
Signed-off-by: xiaweijiang <[email protected]>
Signed-off-by: xiaweijiang <[email protected]>
Signed-off-by: xiaweijiang <[email protected]>
Signed-off-by: xiaweijiang <[email protected]>
@xwjiang-ms xwjiang-ms force-pushed the add_next_hop_threshold_check branch from c9e32e6 to 1c143fd Compare December 2, 2025 01:42
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: xiaweijiang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: xiaweijiang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: xiaweijiang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

t2_miss = []
t1_len = len(t1);
t2_len = len(t2);
t1_len = len(t1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xwjiang-ms why touch this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, this PR shouldn't contain the change

Signed-off-by: xiaweijiang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@prsunny
Copy link
Contributor

prsunny commented Dec 2, 2025

How is this different from CRM monitoring reporting "threshold exceeded" error thats also spewed periodically?

@sonic-net sonic-net deleted a comment from StormLiangMS Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants