Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Routes are being withdrawn from kernel after session flap with 'clear bgp ...' #18240

Open
2 tasks done
dawkop opened this issue Feb 24, 2025 · 0 comments
Open
2 tasks done
Labels
triage Needs further investigation

Comments

@dawkop
Copy link

dawkop commented Feb 24, 2025

Description

When we have a scenario with 1 frr box which is peered with 3 route-servers that advertise the same evpn routes, after clearing a session with 1 of them, all routes are being immediately withdrawn from kernel and re-added after session comes back up. This causes a downtime if there is a significant number of routes to be processed.

Version

My reproduction uses 10.1.1, however the same happens on 10.3-dev

3d4364ce7392# show ver
FRRouting 10.1.1 (3d4364ce7392) on Linux(5.15.0-127-generic).

How to reproduce

I have provided an instruction in my public repo here, just follow README.md in session_flap directory.

Expected behavior

This is a question to FRR maintainers with more expertise, should that be the case? My initial suspicion was that only routes being marked as best-path are being withdrawn, but that is not the case.

Actual behavior

Having routes withdrawn from kernel even though after session clear even though we receive the same routes from at least 1 additional peer.

Additional context

I have did some investigation after which PR it started to behave this way (in frr 8.4.2 this was not the case and when we cleared one out of 3 sessions, the routes were not withdrawn from kernel). The PR after it started to be noticeable was this one. Specifically this line causes immediate withdrawal. I am not sure if that is expected (according to the comment, this withdrawal was added 7 years ago, I do not know what were the circumstances, would need a comment from someone with more knowledge).

Checklist

  • I have searched the open issues for this bug.
  • I have not included sensitive information in this report.
@dawkop dawkop added the triage Needs further investigation label Feb 24, 2025
dawkopagh added a commit to dawkopagh/frr that referenced this issue Mar 12, 2025
Check rmac and nh of new bestpath for evpn imported prefix before withdrawal. Currently when
new bestpath is designated, evpn imported routes are being withdrawn from kernel causing
downtime which length depends on amount of routes to process.

Let's check rmac entry and nh of new selected bestpath and do not actually withdraw them from kernel if
those two are the same. This fixes and issue where the same routes are being advertised by multiple
peers and we clear session with one of them FRRouting#18240

Signed-off-by: Dawid Kopec <dkopec@akamai.com>
dawkopagh added a commit to dawkopagh/frr that referenced this issue Mar 12, 2025
Check rmac and nh of new bestpath for evpn imported prefix before withdrawal. Currently when
new bestpath is designated, evpn imported routes are being withdrawn from kernel causing
downtime which length depends on amount of routes to process.

Basically in a setup where multiple peers advertise identical evpn type-5 routes,
if we clear session with any of them (and at least one is still up), we observe
route withdrawals from kernel (even though we receive those routes from remaining peers).
If the amount of routes is significant it causes a noticeable downtime before routes are
readded to kernel.

Also menitoned behavior (where routes are being withdrawn immediately
even if other peers advertise them) has started to occur after
backpressure bgp zebra client FRRouting#15524

Let's check rmac entry and nh of new selected bestpath and do not actually withdraw them from kernel if
those two are the same. This fixes and issue where the same routes are being advertised by multiple
peers and we clear session with one of them FRRouting#18240

Signed-off-by: Dawid Kopec <dkopec@akamai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

1 participant