Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(diff): always resyncMonitors the first time after acquiring watch leader lease #404

Merged
merged 2 commits into from
Feb 17, 2025

Conversation

SOF3
Copy link
Member

@SOF3 SOF3 commented Feb 14, 2025

When watch leader lease gets acquired the first time during a normal startup, discoveryResyncCh got its initial signal from the first cluster discovery sync, thus resyncMonitors runs exactly once as expected.

However, when the watch leader lease is lost after a while and later re-acquired, discoveryResyncCh would not generate new signals if the cluster is unchanged, so the watcher is stuck at the initial state with no monitors, thus workers never receive any events. When the same instance later acquires the diff writer lease, since diff workers never receive any events, the diff writer would not write any diff.

This is fixed by always ensuring a resyncMonitors run during each resyncMonitorsLoop call (which is called during each watch leader term).

… leader lease

When watch leader lease gets acquired the first time during a normal
startup, discoveryResyncCh got its initial signal from the first cluster
discovery sync, thus resyncMonitors runs exactly once as expected.

However, when the watch leader lease is lost after a while and later
re-acquired, discoveryResyncCh would not generate new signals if the
cluster is unchanged, so the watcher is stuck at the initial state with
no monitors, thus workers never receive any events. When the same
instance later acquires the diff writer lease, since diff workers never
receive any events, the diff writer would not write any diff.

This is fixed by always ensuring a resyncMonitors run during each
resyncMonitorsLoop call (which is called during each watch leader term).
@SOF3 SOF3 added the bug Something isn't working label Feb 14, 2025
@SOF3 SOF3 requested a review from xuqingyun February 14, 2025 10:10
Copy link
Collaborator

@xuqingyun xuqingyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@SOF3 SOF3 force-pushed the diff-writer-reelect branch from 17756c9 to af57115 Compare February 17, 2025 02:06
@SOF3 SOF3 added this pull request to the merge queue Feb 17, 2025
Merged via the queue into kubewharf:main with commit fa7cef5 Feb 17, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants