You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
On an indexing node, we observed that merges stopped and the merge backlog started growing indefinitely. Restarting the node fixed it.
Likely source for the bug
I suspect the issue lies in the MergePlanner mailbox that is referenced by the Publisher after the MergePlanner exits. The mailbox gets full (capacity 1) and thus blocks the Publisher when calling ctx.send_message(source_mailbox,NewSplit). This bugs happened only once, so it likely requires a very rare conjunction of events (e.g the IndexerPublisher and the MergePublisher sending a NewSplit to the MergePlanner simultaneously while this one is shutting down)
Detailed observations:
Couldn't reproduce, but here the observations made in the logs:
a lot of nodes are leaving and joining the cluster, and because the source is ingest V2 this triggers pipeline shutdowns caused by shard rebalancings
the pending metric grows steadily
the active merge operation metric is stuck at 3 (the maximum number of concurrent merges)
when filtering down the logs on the merge events, we observe that no pipelines is making progress. The last merge events are (Split ids were renamed and redacted):
<a dozen irrelevant merge logs before this>
Feb 26 23:32:41 quickwit-indexer-43 quickwit-indexer INFO merge{merge_split_id=MERGED_1 split_ids=[...] typ=Merge}:merge_executor: quickwit_indexing::actors::merge_executor: merge-operation-success merged_num_docs=169455 elapsed_secs=3.6228392 operation_type=Merge
Feb 26 23:32:42 quickwit-indexer-43 quickwit-indexer INFO merge{merge_split_id=MERGED_2 split_ids=[...] typ=Merge}:publisher{split_update=SplitsUpdate { index_id: "my_index", new_splits: "...", checkpoint_delta: None }}: quickwit_indexing::actors::publisher: publish-new-splits
Feb 26 23:32:42 quickwit-indexer-43 quickwit-indexer INFO merge{merge_split_id=MERGED_3 split_ids=[...] typ=Merge}:merge_split_downloader: quickwit_indexing::actors::merge_split_downloader: download-merge-splits dir=/quickwit/qwdata/indexing/my_index%...%_ingest-source%...%0DFtic/merge%SNNV1s
Feb 26 23:32:48 quickwit-indexer-43 quickwit-indexer INFO merge{merge_split_id=MERGED_4 split_ids=[...] typ=Merge}:publisher{split_update=SplitsUpdate { index_id: "my_index", new_splits: "...", checkpoint_delta: None }}: quickwit_indexing::actors::publisher: publish-new-splits
Feb 26 23:32:50 quickwit-indexer-43 quickwit-indexer INFO merge{merge_split_id=MERGED_3 split_ids=[...] typ=Merge}:merge_executor: quickwit_indexing::actors::merge_executor: merge-operation-success merged_num_docs=197554 elapsed_secs=3.9022155 operation_type=Merge
<no more merge logs after this>
MERGED_2 is published and aparently the semaphore is properly released as MERGED_3 is scheduled for download. We see that the last event for merged split MERGED_1 and MERGED_3 is merge-operation-success, so the split never reaches the publish-new-splits point. MERGED_4 reaches publish-new-splits but a new merge is never scheduled.
During the same period, we see the following number of occurrences for the different logs:
30 "shutdown merge pipeline initiated" (30-27=3 this matches with the merge permit count)
29 "disconnecting merge planner mailbox"
30 MergePlanner success
28 MergeSplitDownloader success
28 MergeExecutor success
28 MergePackager success
28 MergeUploader success
27 MergePublisher success
Between “shutdown xxx” and “disconnecting xxx” there is only 1 message passing from MergePipeline to MergePublisher, which means that either:
a MergePublisher is hanging
MergePipeline.handles_opt is None, so the merge pipeline was terminated (but no indication of merge actors being killed found
the 2 pipelines (28=30-2) that didn’t stop are likely hanging because the scheduler is still holding the MergeSplitDownloader mailbox in the pending queue.
Merge planners exit with success even if the mailbox isn’t detached because RunFinalizeMergePolicyAndQuit returns Err(ActorExitStatus::Success), which explains why we have 30 MergePlanner events
Possible calls that could have the Publisher hang:
Describe the bug
On an indexing node, we observed that merges stopped and the merge backlog started growing indefinitely. Restarting the node fixed it.
Likely source for the bug
I suspect the issue lies in the MergePlanner mailbox that is referenced by the Publisher after the MergePlanner exits. The mailbox gets full (capacity 1) and thus blocks the Publisher when calling
ctx.send_message(source_mailbox,NewSplit)
. This bugs happened only once, so it likely requires a very rare conjunction of events (e.g the IndexerPublisher and the MergePublisher sending a NewSplit to the MergePlanner simultaneously while this one is shutting down)Detailed observations:
Couldn't reproduce, but here the observations made in the logs:
MERGED_2 is published and aparently the semaphore is properly released as MERGED_3 is scheduled for download. We see that the last event for merged split MERGED_1 and MERGED_3 is
merge-operation-success
, so the split never reaches thepublish-new-splits
point. MERGED_4 reachespublish-new-splits
but a new merge is never scheduled.During the same period, we see the following number of occurrences for the different logs:
Between “shutdown xxx” and “disconnecting xxx” there is only 1 message passing from MergePipeline to MergePublisher, which means that either:
the 2 pipelines (28=30-2) that didn’t stop are likely hanging because the scheduler is still holding the MergeSplitDownloader mailbox in the pending queue.
Merge planners exit with success even if the mailbox isn’t detached because RunFinalizeMergePolicyAndQuit returns Err(ActorExitStatus::Success), which explains why we have 30 MergePlanner events
Possible calls that could have the Publisher hang:
publish-new-splits
on MERGED_4Hints
We might need to use failpoints to reproduce the race condition.
Configuration:
Version:
https://github.com/quickwit-oss/quickwit/releases/tag/qw-airmail-20250219-hotfix-merge-panic
The text was updated successfully, but these errors were encountered: