Skip to content

Conversation

@Pdoijode
Copy link
Contributor

This MR contains changes for EVPN GR and GR to support multihop BGP sessions. Please refer to individual commits.

Copy link
Member

@ton31337 ton31337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graceful Restart RFC does not define this multi-level "deferring", it's a implementation-specific case, right?

return;

ctx->cnt = bgp_deferred_path_selection(bgp, afi, safi, vpn->mac_table, ctx->cnt, vpn, true);
ctx->cnt = bgp_deferred_path_selection(bgp, afi, safi, vpn->ip_table, ctx->cnt, vpn, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct really? Why do we "reset" cnt? Shouldn't it be a sum?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It normally would/should be ...

To support GR for L2VPN EVPN AFI SAFI in BGP, following changes were made:

1. If GR is enabled for a BGP instance, then GR select
deferral timer will be started for all GR supported AFI-SFIs after
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo => SFIs

@riw777 riw777 self-requested a review October 21, 2025 14:52
@Pdoijode
Copy link
Contributor Author

@ton31337 Graceful Restart RFC does not define multi-level "deferring", this is implementation-specific case.

@pbrisset pbrisset self-requested a review October 22, 2025 18:33
for (safi = SAFI_UNICAST; safi <= SAFI_MPLS_VPN; safi++)
#define FOREACH_AFI_SAFI_NSF(afi, safi) \
for (afi = AFI_IP; afi < AFI_MAX; afi++) \
for (safi = SAFI_UNICAST; safi <= SAFI_EVPN; safi++)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this not enable GR for ENCAP SAFI as well ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrin-g Mrinmoy, No, GR is now enabled for L2VPN EVPN AFI-SAFI in addition to v4 and v6 unicast.

Copy link

@mrin-g mrin-g Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what I meant was:

	lib/zebra.h:
	SAFI_MPLS_VPN = 3,
	SAFI_ENCAP = 4, <-- 
	SAFI_EVPN = 5,
	```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@Pdoijode Pdoijode force-pushed the pdoijode/evpn-gr-mh branch from d4e7b39 to 33c5a86 Compare October 27, 2025 20:08
@github-actions github-actions bot added the rebase PR needs rebase label Oct 27, 2025
Copy link
Member

@ton31337 ton31337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Topotest is required to verify GR for EVPN...

Copy link
Member

@riw777 riw777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small questions/thoughts ... there should be a topo test for this.

return;

ctx->cnt = bgp_deferred_path_selection(bgp, afi, safi, vpn->mac_table, ctx->cnt, vpn, true);
ctx->cnt = bgp_deferred_path_selection(bgp, afi, safi, vpn->ip_table, ctx->cnt, vpn, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It normally would/should be ...

if field_val == 1:
return "Check all EORs"
elif field_val == 2:
return "All dir conn EORs rcvd"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to abbreviate "dir", "con", and "rcd" here? Would expanding them make the output too long?

@Pdoijode
Copy link
Contributor Author

@ton31337 @riw777 I am working on adding topotest for l2vpn evpn GR and for GR with multihop BGP session

@Pdoijode Pdoijode force-pushed the pdoijode/evpn-gr-mh branch 2 times, most recently from b2c9d14 to 03de9ac Compare November 4, 2025 18:01
@github-actions github-actions bot removed the conflicts label Nov 4, 2025
GR handling for multihop BGP peers whose loopback is
learnt from another BGP peer.

Added a tier2 GR timer which is started when:
1. All the directly connected peers comeup for an AFI-SAFI or
2. Tier1 GR select deferral timer expires(due to all direcyly connected peers
   not coming up) and if multihop peer exists and it is activated for afi-safi

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Added traces for graceful restart functionality.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
…2 is set

Problem:
Select deferral timer was running even after BGP
indicated BGP completion to zebra.

This happens when:
Multihop peers come up befor directly connected
peers, then the tier2 timer is not required and will
not be started.

Which means that select_defer_tier2_required variable will be
false (This variable is set to true only if tier2 timer is started).

If select_defer_tier2_required is false, then BGP in
bgp_gr_start_tier2_timer_if_required() will end up
starting the tier2 timer even when it's not required (not
required since EORs from all multihop peers were received).

Fix:
To prevent such a scenario, do not start tier2 timer in
bgp_gr_start_tier2_timer_if_required() if select_defer_over_tier2
is set to true.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Added selectionDeferralTier2TimerRemaining and selectionDeferralTier2Timer
to show bgp neighbor <NBR ADDR> JSON output

show bgp neighbors 2.1.1.2 json
{
  "2.1.1.2":{
      "gracefulRestartInfo":{
      "endOfRibSend":{
      },
      "endOfRibRecv":{
      },
      "localGrMode":"Restart*",
      "remoteGrMode":"Restart",
      "rBit":false,
      "nBit":false,
      "timers":{
        "configuredRestartTimer":300,
        "receivedRestartTimer":300
      },
      "l2VpnEvpn":{
        "fBit":false,
        "endOfRibStatus":{
          "endOfRibSend":false,
          "endOfRibSentAfterUpdate":false,
          "endOfRibRecv":false
        },
        "timers":{
          "stalePathTimer":360,
          "selectionDeferralTimer":240,
          "selectionDeferralTier2Timer":240,
          "selectionDeferralTier2TimerRemaining":238
        }
      }
    },
  }
}

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 47, 'afi': 1, 'location': 'Process from GR queue'}
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 47, 'afi': 2, 'location': 'Process from GR queue'}
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 47, 'afi': 3, 'location': 'Process from GR queue'}
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 47, 'afi': 0, 'location': 'Process from GR queue'}
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 42, 'afi': 1, 'location': 'Stale route delete from table'}
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 42, 'afi': 2, 'location': 'Stale route delete from table'}
2024-04-17T19:31:20.423 frr_zebra:gr_client_not_found {'vrf_id': 42, 'afi': 3, 'location': 'Stale route delete from table'}

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Pdoijode and others added 26 commits November 4, 2025 22:52
Problem:
In case of BGP GR with Multihop(MH) peers, BGP would send EOR
after select-deferral-timers stopped/timed-out but before
running deferred bestpath selection for all the deferred paths.

Fix:
Send EOR to helpers (both directly connected and MH peers) only
after overall GR is complete.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Currently default time for GR select-deferral timer is 240secs.
Incase if some of the directly connected and multihop peers don't
comeup at all, then 240s is a lot of time to wait and this might
cause stale-path timer to time out in multihop cases.

Reducing this timer's default value to 120 seconds

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Enable graceful restart support for L2VPN EVPN AFI SAFI

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Optimize BGP GR code which evaluates if BGP has any
multihop peers with GR supported AFI-SAFI.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Added a check to not communicate UPDATE_PENDING/UPDATE_COMPLETE
for L2VPN EVPN afi-safi in non-default VRF.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
In order to support GR for L2VPN EVPN AFI SAFI,
All L2VPN route types must be marked as deferred in default
VRF, imported into destination VRF/VNI table and then
marked as deferred in destination VRF/VNI as well.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
This commit contains changes required to do the deferred
best path selection in

1. Default VRF for l2vpn evpn table
2. VNI IP and MAC tables
3. Destination VRF

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
On GR helper, when the peer that has GR enabled goes down,
BGP will mark all the L2VPN EVPN routes in default VRF as stale.

If the peer doesn't comeup before GR restart timer expires, then
the L2VPN EVPN routes in default VRF that were previously marked
stale needs to be unimported from all the VRFs and VNIs (IP and MAC
table) before deleting it in L2VPN EVPN table in default VRF.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Read the remote VTEP, remote MAC and remote rmac entries from kernel
and store the remote VTEP in vtep list, remote MAC in L2VNI mac table
and rmac in L3VNI mac table, if zebra is starting gracefully.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Read the remote neigh entries from kernel and store it in L2VNI or L3VNI
neigh table, if zebra is starting gracefully.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Introduced gr_refresh_time field for HREP entries. This new field in
zebra_vtep will be used to store the timestamp at which this entry
was either created or refreshed. The timestamp stored in this field
will determine if the HREP entry is stale or not.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
During stale cleanup, zebra uses restart_time which is of type uint64_t to
determine if the entries are stale or not. Since we are adding support to
clean up the stale mac entries after GR completion, timestamp recorded in
gr_refresh_time field in zebra_mac will be used to determine if a remote MAC
entry is stale or not.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
During stale cleanup, zebra uses restart_time which is of type uint64_t to
determine if the entries are stale or not. Since we are adding support to
clean up the stale neigh entries after GR completion, a new field named
gr_refresh_time has been introduced to store the timestamp at which the
neigh entry was created or refreshed.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>

.
…VTEPs

After GR is complete for all AFI SAFIs in all GR enabled VRFs, call
the evpn cleanup function which goes through all the L2VNIs and L3VNIs
and cleans up the stale neighs, MACs, RMACs and remote VTEPs.

Stale entries are those entries whose gr_refresh_time is less than the
gr_cleanup_time. I,e.,. BGP failed to refresh or redownload these entries
to zebra. Such entries need to be cleanup from bridge FDB and neigh table
in kernel.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Changes:
    1. Since L2VPN EVPN prefixes cannot be used to form a BGP
       session, BGP must wait to receive EORs
       from all multihop and directly connected peer before
       doing deferred bestpath selection
    2. Return if BGP has already done 1st round of deferred
       bestpath selection (Either because of tier1
       select-deferral-timeout or EORs were received from all
       the directly connected peers) and if it is still waiting
       for all multihop bgp peers to come up.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Added lttng traces in zebra GR code

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
There could be some VRFs/L3VNIs that do not have GR enabled/configured.
But if L2vpn EVPN type5 prefixes were imported from default VRF into the
destination VRF that is not GR enabled, then zebra needs to cleanup
the unicast routes from such VRFs as well.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Problem:
Although BGP downloaded the remote MAC to zebra after GR, zebra ends
up deleting these remote MACs. This is because, if update_mac's
value is false, mac's gr_refresh_time was not being recorded.
This caused zebra to think that the remote MAC is stale!
Thus zebra ended up deleting the remote MACs. Same problem
could occur for neighs.

Fix:
1. Update MAC's gr_refresh_time in
zebra_evpn_mac_remote_macip_add()

2. Update NEIGH's gr_refresh_time in
zebra_evpn_neigh_remote_macip_add()

3. Record gr_refresh_time for mac in zebra_evpn_mac_gw_macip_add()
   for both if and else case

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Problem:
With graceful-restart enabled only in default VRF, leaked routes were being
deleted after GR completion. This was happening because, BGP was
installing imported routes into non-default VRF at time T1. Later at
time T2, client->restart_time was being recorded when UPDATE_PENING was
rcvd from BGP. Now, at the time of GR stale route deletion, since T1 is
less than T2, the imported routes were being deleted by zebra from
kernel.

Fix:
Record the restart time once when BGP sends GR enable notification to
zebra and removed code to record the restart_time when BGP sends
UPDATE_PENDING.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
After fast stop of FRR on dut and shutting down the sessions to DUT
on all other peers, and fast start of FRR, stale EVPN entries
(remote MAC, NEIGH and HREP) entries weren't getting deleted from
kernel.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Consider a case where default vrf is EVPN enabled and there are some
tenant VRFs with no peers which imports some routes from default VRF.

With the current code for EVPN GR, BGP declares GR completion
after the default VRF is complete and not wait for other VRFs to complete
 its deferral path selection.

Reason for this is because of the following sequence of events,
 a) Since there are no peers in tenant VRFs(non default VRFs), we
    prematurely send UPDATE_PENDING + UPDATE_COMPLETE in
    peer_unshut_after_cfg().
 b) When bgp_do_deferred_path_selection() is invoked for default VRF, it
    completes the deferred path calculation and calls
    bgp_evpn_handle_deferred_bestpath_for_vrfs() to enqueue the deferred
    path selection for non default VRFs.
 c) It then(for default VRF) sends UPDATE_COMPLETE followed by GR done
    for all instances from BGP pov because the deferral related variables
    are not set appropriately for non default VRFs (especially
    gr_route_sync_pending)

Fixing the above by setting deferral related variables for non default
VRFs appropriately and sending UPDATE_PENDING accordingly.

Signed-off-by: Rajasekar Raja <[email protected]>
Signed-off-by: Chirag Shah <[email protected]>
Signed-off-by: Vijayalaxmi Basavaraj <[email protected]>
Added comments and trace in GR code for
more clarity and easy debugging

Signed-off-by: Rajasekar Raja <[email protected]>
Signed-off-by: Chirag Shah <[email protected]>
Added a check to not communicate UPDATE_PENDING/UPDATE_COMPLETE
for L2VPN EVPN afi-safi in non-default VRF.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
This topotest has tests to test GR functionality when a mix of
directly connected and multihop peer are present on DUT.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Problem:
if -K option is not specified then BGP was not starting the select
deferral timer for any of the GR supported afi-safis though
startup timer was running.

Fix:
Modified the check to skip starting select-deferral timer
only if -K option is not specified in daemons file and t_startup
timer is not running.

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
Error:

    ERROR: AddressSanitizer: heap-use-after-free on address 0x6070000ef8a0 at pc 0x555df66ba094 bp 0x7ffc13d67240 sp 0x7ffc13d67238
    READ of size 4 at 0x6070000ef8a0 thread T0
        #0 0x555df66ba093 in zebra_gr_delete_stale_route_table_afi zebra/zebra_gr.c:514
        FRRouting#1 0x7fd33d6db06e in event_call lib/event.c:2013
        FRRouting#2 0x7fd33d5fffa1 in frr_run lib/libfrr.c:1257
        FRRouting#3 0x555df66531ec in main zebra/main.c:552
        FRRouting#4 0x7fd33d10c249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
        FRRouting#5 0x7fd33d10c304 in __libc_start_main_impl ../csu/libc-start.c:360
        FRRouting#6 0x555df6626b40 in _start (/usr/lib/frr/zebra+0x1a1b40)

    0x6070000ef8a0 is located 0 bytes inside of 72-byte region [0x6070000ef8a0,0x6070000ef8e8)
    freed by thread T0 here:
        #0 0x7fd33dab76a8 in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:52
        FRRouting#1 0x7fd33d622cd5 in qfree lib/memory.c:136
        FRRouting#2 0x555df66b9e5f in zebra_gr_client_info_delete zebra/zebra_gr.c:130
        FRRouting#3 0x555df66bc66f in zread_client_capabilities zebra/zebra_gr.c:355
        FRRouting#4 0x555df66a025c in zserv_handle_commands zebra/zapi_msg.c:4228
        FRRouting#5 0x555df67cde33 in zserv_process_messages zebra/zserv.c:565
        FRRouting#6 0x7fd33d6db06e in event_call lib/event.c:2013
        FRRouting#7 0x7fd33d5fffa1 in frr_run lib/libfrr.c:1257
        FRRouting#8 0x555df66531ec in main zebra/main.c:552
        FRRouting#9 0x7fd33d10c249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

    previously allocated by thread T0 here:
        #0 0x7fd33dab83b7 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:77
        FRRouting#1 0x7fd33d6223e2 in qcalloc lib/memory.c:111
        FRRouting#2 0x555df66bbace in zebra_gr_client_info_create zebra/zebra_gr.c:101
        FRRouting#3 0x555df66bbace in zread_client_capabilities zebra/zebra_gr.c:360
        FRRouting#4 0x555df66a025c in zserv_handle_commands zebra/zapi_msg.c:4228
        FRRouting#5 0x555df67cde33 in zserv_process_messages zebra/zserv.c:565
        FRRouting#6 0x7fd33d6db06e in event_call lib/event.c:2013
        FRRouting#7 0x7fd33d5fffa1 in frr_run lib/libfrr.c:1257
        FRRouting#8 0x555df66531ec in main zebra/main.c:552
        FRRouting#9 0x7fd33d10c249 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Signed-off-by: Pooja Jagadeesh Doijode <[email protected]>
@Pdoijode Pdoijode force-pushed the pdoijode/evpn-gr-mh branch from 67c4858 to 89163f8 Compare November 4, 2025 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants