-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Graceful restart for EVPN #19778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Graceful restart for EVPN #19778
Conversation
f6c3dab to
0c01fd8
Compare
ton31337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graceful Restart RFC does not define this multi-level "deferring", it's a implementation-specific case, right?
| return; | ||
|
|
||
| ctx->cnt = bgp_deferred_path_selection(bgp, afi, safi, vpn->mac_table, ctx->cnt, vpn, true); | ||
| ctx->cnt = bgp_deferred_path_selection(bgp, afi, safi, vpn->ip_table, ctx->cnt, vpn, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct really? Why do we "reset" cnt? Shouldn't it be a sum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It normally would/should be ...
| To support GR for L2VPN EVPN AFI SAFI in BGP, following changes were made: | ||
|
|
||
| 1. If GR is enabled for a BGP instance, then GR select | ||
| deferral timer will be started for all GR supported AFI-SFIs after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo => SFIs
|
@ton31337 Graceful Restart RFC does not define multi-level "deferring", this is implementation-specific case. |
| for (safi = SAFI_UNICAST; safi <= SAFI_MPLS_VPN; safi++) | ||
| #define FOREACH_AFI_SAFI_NSF(afi, safi) \ | ||
| for (afi = AFI_IP; afi < AFI_MAX; afi++) \ | ||
| for (safi = SAFI_UNICAST; safi <= SAFI_EVPN; safi++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this not enable GR for ENCAP SAFI as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrin-g Mrinmoy, No, GR is now enabled for L2VPN EVPN AFI-SAFI in addition to v4 and v6 unicast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what I meant was:
lib/zebra.h:
SAFI_MPLS_VPN = 3,
SAFI_ENCAP = 4, <--
SAFI_EVPN = 5,
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrin-g I think there is another function controlling this:
7e2045c#diff-f7b4da9b53d00be700be61d8bbd3d0113a456949a67b22b566abc6cd089657c5R3130
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
d4e7b39 to
33c5a86
Compare
ton31337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Topotest is required to verify GR for EVPN...
riw777
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small questions/thoughts ... there should be a topo test for this.
| return; | ||
|
|
||
| ctx->cnt = bgp_deferred_path_selection(bgp, afi, safi, vpn->mac_table, ctx->cnt, vpn, true); | ||
| ctx->cnt = bgp_deferred_path_selection(bgp, afi, safi, vpn->ip_table, ctx->cnt, vpn, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It normally would/should be ...
| if field_val == 1: | ||
| return "Check all EORs" | ||
| elif field_val == 2: | ||
| return "All dir conn EORs rcvd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to abbreviate "dir", "con", and "rcd" here? Would expanding them make the output too long?
b2c9d14 to
03de9ac
Compare
GR handling for multihop BGP peers whose loopback is learnt from another BGP peer. Added a tier2 GR timer which is started when: 1. All the directly connected peers comeup for an AFI-SAFI or 2. Tier1 GR select deferral timer expires(due to all direcyly connected peers not coming up) and if multihop peer exists and it is activated for afi-safi Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Added traces for graceful restart functionality. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
…2 is set Problem: Select deferral timer was running even after BGP indicated BGP completion to zebra. This happens when: Multihop peers come up befor directly connected peers, then the tier2 timer is not required and will not be started. Which means that select_defer_tier2_required variable will be false (This variable is set to true only if tier2 timer is started). If select_defer_tier2_required is false, then BGP in bgp_gr_start_tier2_timer_if_required() will end up starting the tier2 timer even when it's not required (not required since EORs from all multihop peers were received). Fix: To prevent such a scenario, do not start tier2 timer in bgp_gr_start_tier2_timer_if_required() if select_defer_over_tier2 is set to true. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Added selectionDeferralTier2TimerRemaining and selectionDeferralTier2Timer
to show bgp neighbor <NBR ADDR> JSON output
show bgp neighbors 2.1.1.2 json
{
"2.1.1.2":{
"gracefulRestartInfo":{
"endOfRibSend":{
},
"endOfRibRecv":{
},
"localGrMode":"Restart*",
"remoteGrMode":"Restart",
"rBit":false,
"nBit":false,
"timers":{
"configuredRestartTimer":300,
"receivedRestartTimer":300
},
"l2VpnEvpn":{
"fBit":false,
"endOfRibStatus":{
"endOfRibSend":false,
"endOfRibSentAfterUpdate":false,
"endOfRibRecv":false
},
"timers":{
"stalePathTimer":360,
"selectionDeferralTimer":240,
"selectionDeferralTier2Timer":240,
"selectionDeferralTier2TimerRemaining":238
}
}
},
}
}
Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 47, 'afi': 1, 'location': 'Process from GR queue'}
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 47, 'afi': 2, 'location': 'Process from GR queue'}
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 47, 'afi': 3, 'location': 'Process from GR queue'}
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 47, 'afi': 0, 'location': 'Process from GR queue'}
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 42, 'afi': 1, 'location': 'Stale route delete from table'}
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 42, 'afi': 2, 'location': 'Stale route delete from table'}
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 42, 'afi': 3, 'location': 'Stale route delete from table'}
Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Problem: In case of BGP GR with Multihop(MH) peers, BGP would send EOR after select-deferral-timers stopped/timed-out but before running deferred bestpath selection for all the deferred paths. Fix: Send EOR to helpers (both directly connected and MH peers) only after overall GR is complete. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Currently default time for GR select-deferral timer is 240secs. Incase if some of the directly connected and multihop peers don't comeup at all, then 240s is a lot of time to wait and this might cause stale-path timer to time out in multihop cases. Reducing this timer's default value to 120 seconds Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Enable graceful restart support for L2VPN EVPN AFI SAFI Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Optimize BGP GR code which evaluates if BGP has any multihop peers with GR supported AFI-SAFI. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Added a check to not communicate UPDATE_PENDING/UPDATE_COMPLETE for L2VPN EVPN afi-safi in non-default VRF. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
In order to support GR for L2VPN EVPN AFI SAFI, All L2VPN route types must be marked as deferred in default VRF, imported into destination VRF/VNI table and then marked as deferred in destination VRF/VNI as well. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
This commit contains changes required to do the deferred best path selection in 1. Default VRF for l2vpn evpn table 2. VNI IP and MAC tables 3. Destination VRF Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
On GR helper, when the peer that has GR enabled goes down, BGP will mark all the L2VPN EVPN routes in default VRF as stale. If the peer doesn't comeup before GR restart timer expires, then the L2VPN EVPN routes in default VRF that were previously marked stale needs to be unimported from all the VRFs and VNIs (IP and MAC table) before deleting it in L2VPN EVPN table in default VRF. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Read the remote VTEP, remote MAC and remote rmac entries from kernel and store the remote VTEP in vtep list, remote MAC in L2VNI mac table and rmac in L3VNI mac table, if zebra is starting gracefully. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Read the remote neigh entries from kernel and store it in L2VNI or L3VNI neigh table, if zebra is starting gracefully. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Introduced gr_refresh_time field for HREP entries. This new field in zebra_vtep will be used to store the timestamp at which this entry was either created or refreshed. The timestamp stored in this field will determine if the HREP entry is stale or not. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
During stale cleanup, zebra uses restart_time which is of type uint64_t to determine if the entries are stale or not. Since we are adding support to clean up the stale mac entries after GR completion, timestamp recorded in gr_refresh_time field in zebra_mac will be used to determine if a remote MAC entry is stale or not. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
During stale cleanup, zebra uses restart_time which is of type uint64_t to determine if the entries are stale or not. Since we are adding support to clean up the stale neigh entries after GR completion, a new field named gr_refresh_time has been introduced to store the timestamp at which the neigh entry was created or refreshed. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]> .
…VTEPs After GR is complete for all AFI SAFIs in all GR enabled VRFs, call the evpn cleanup function which goes through all the L2VNIs and L3VNIs and cleans up the stale neighs, MACs, RMACs and remote VTEPs. Stale entries are those entries whose gr_refresh_time is less than the gr_cleanup_time. I,e.,. BGP failed to refresh or redownload these entries to zebra. Such entries need to be cleanup from bridge FDB and neigh table in kernel. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Changes:
1. Since L2VPN EVPN prefixes cannot be used to form a BGP
session, BGP must wait to receive EORs
from all multihop and directly connected peer before
doing deferred bestpath selection
2. Return if BGP has already done 1st round of deferred
bestpath selection (Either because of tier1
select-deferral-timeout or EORs were received from all
the directly connected peers) and if it is still waiting
for all multihop bgp peers to come up.
Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Added lttng traces in zebra GR code Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
There could be some VRFs/L3VNIs that do not have GR enabled/configured. But if L2vpn EVPN type5 prefixes were imported from default VRF into the destination VRF that is not GR enabled, then zebra needs to cleanup the unicast routes from such VRFs as well. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Problem: Although BGP downloaded the remote MAC to zebra after GR, zebra ends up deleting these remote MACs. This is because, if update_mac's value is false, mac's gr_refresh_time was not being recorded. This caused zebra to think that the remote MAC is stale! Thus zebra ended up deleting the remote MACs. Same problem could occur for neighs. Fix: 1. Update MAC's gr_refresh_time in zebra_evpn_mac_remote_macip_add() 2. Update NEIGH's gr_refresh_time in zebra_evpn_neigh_remote_macip_add() 3. Record gr_refresh_time for mac in zebra_evpn_mac_gw_macip_add() for both if and else case Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Problem: With graceful-restart enabled only in default VRF, leaked routes were being deleted after GR completion. This was happening because, BGP was installing imported routes into non-default VRF at time T1. Later at time T2, client->restart_time was being recorded when UPDATE_PENING was rcvd from BGP. Now, at the time of GR stale route deletion, since T1 is less than T2, the imported routes were being deleted by zebra from kernel. Fix: Record the restart time once when BGP sends GR enable notification to zebra and removed code to record the restart_time when BGP sends UPDATE_PENDING. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
After fast stop of FRR on dut and shutting down the sessions to DUT on all other peers, and fast start of FRR, stale EVPN entries (remote MAC, NEIGH and HREP) entries weren't getting deleted from kernel. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Consider a case where default vrf is EVPN enabled and there are some
tenant VRFs with no peers which imports some routes from default VRF.
With the current code for EVPN GR, BGP declares GR completion
after the default VRF is complete and not wait for other VRFs to complete
its deferral path selection.
Reason for this is because of the following sequence of events,
a) Since there are no peers in tenant VRFs(non default VRFs), we
prematurely send UPDATE_PENDING + UPDATE_COMPLETE in
peer_unshut_after_cfg().
b) When bgp_do_deferred_path_selection() is invoked for default VRF, it
completes the deferred path calculation and calls
bgp_evpn_handle_deferred_bestpath_for_vrfs() to enqueue the deferred
path selection for non default VRFs.
c) It then(for default VRF) sends UPDATE_COMPLETE followed by GR done
for all instances from BGP pov because the deferral related variables
are not set appropriately for non default VRFs (especially
gr_route_sync_pending)
Fixing the above by setting deferral related variables for non default
VRFs appropriately and sending UPDATE_PENDING accordingly.
Signed-off-by: Rajasekar Raja <[email protected]>
Signed-off-by: Chirag Shah <[email protected]>
Signed-off-by: Vijayalaxmi Basavaraj <[email protected]>
Added comments and trace in GR code for more clarity and easy debugging Signed-off-by: Rajasekar Raja <[email protected]> Signed-off-by: Chirag Shah <[email protected]>
Added a check to not communicate UPDATE_PENDING/UPDATE_COMPLETE for L2VPN EVPN afi-safi in non-default VRF. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
This topotest has tests to test GR functionality when a mix of directly connected and multihop peer are present on DUT. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Problem: if -K option is not specified then BGP was not starting the select deferral timer for any of the GR supported afi-safis though startup timer was running. Fix: Modified the check to skip starting select-deferral timer only if -K option is not specified in daemons file and t_startup timer is not running. Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Error:
ERROR: AddressSanitizer: heap-use-after-free on address 0x6070000ef8a0 at pc 0x555df66ba094 bp 0x7ffc13d67240 sp 0x7ffc13d67238
READ of size 4 at 0x6070000ef8a0 thread T0
#0 0x555df66ba093 in zebra_gr_delete_stale_route_table_afi zebra/zebra_gr.c:514
FRRouting#1 0x7fd33d6db06e in event_call lib/event.c:2013
FRRouting#2 0x7fd33d5fffa1 in frr_run lib/libfrr.c:1257
FRRouting#3 0x555df66531ec in main zebra/main.c:552
FRRouting#4 0x7fd33d10c249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
FRRouting#5 0x7fd33d10c304 in __libc_start_main_impl ../csu/libc-start.c:360
FRRouting#6 0x555df6626b40 in _start (/usr/lib/frr/zebra+0x1a1b40)
0x6070000ef8a0 is located 0 bytes inside of 72-byte region [0x6070000ef8a0,0x6070000ef8e8)
freed by thread T0 here:
#0 0x7fd33dab76a8 in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:52
FRRouting#1 0x7fd33d622cd5 in qfree lib/memory.c:136
FRRouting#2 0x555df66b9e5f in zebra_gr_client_info_delete zebra/zebra_gr.c:130
FRRouting#3 0x555df66bc66f in zread_client_capabilities zebra/zebra_gr.c:355
FRRouting#4 0x555df66a025c in zserv_handle_commands zebra/zapi_msg.c:4228
FRRouting#5 0x555df67cde33 in zserv_process_messages zebra/zserv.c:565
FRRouting#6 0x7fd33d6db06e in event_call lib/event.c:2013
FRRouting#7 0x7fd33d5fffa1 in frr_run lib/libfrr.c:1257
FRRouting#8 0x555df66531ec in main zebra/main.c:552
FRRouting#9 0x7fd33d10c249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
previously allocated by thread T0 here:
#0 0x7fd33dab83b7 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:77
FRRouting#1 0x7fd33d6223e2 in qcalloc lib/memory.c:111
FRRouting#2 0x555df66bbace in zebra_gr_client_info_create zebra/zebra_gr.c:101
FRRouting#3 0x555df66bbace in zread_client_capabilities zebra/zebra_gr.c:360
FRRouting#4 0x555df66a025c in zserv_handle_commands zebra/zapi_msg.c:4228
FRRouting#5 0x555df67cde33 in zserv_process_messages zebra/zserv.c:565
FRRouting#6 0x7fd33d6db06e in event_call lib/event.c:2013
FRRouting#7 0x7fd33d5fffa1 in frr_run lib/libfrr.c:1257
FRRouting#8 0x555df66531ec in main zebra/main.c:552
FRRouting#9 0x7fd33d10c249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
67c4858 to
89163f8
Compare
This MR contains changes for EVPN GR and GR to support multihop BGP sessions. Please refer to individual commits.