Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Make summaries backfill job more resilient. #1071

Merged
merged 1 commit into from
Jan 16, 2025

Conversation

romanrizzi
Copy link
Member

To quickly select backfill candidates without comparing SHAs, we compare the last summarized post to the topic's highest_post_number. However, hiding or deleting a post and adding a small action will update this column, causing the job to stall and re-generate the same summary repeatedly until someone posts a regular reply. On top of this, this is not always true for topics with best_replies, as this last reply isn't necessarily included.

Since this is not evident at first glance and each summarization strategy picks its targets differently, I'm opting to simplify the backfill logic and how we track potential candidates.

The first step is dropping content_range, which serves no purpose and it's there because summary caching was supposed to work differently at the beginning. So instead, I'm replacing it with a column called highest_target_number, which tracks highest_post_number for topics and could track other things like channel's message_count in the future.

Now that we have this column when selecting every potential backfill candidate, we'll check if the summary is truly outdated by comparing the SHAs, and if it's not, we just update the column and move on

@romanrizzi romanrizzi force-pushed the summary_backfill_resilience branch from 3e5cfee to f0c202a Compare January 15, 2025 19:56
To quickly select backfill candidates without comparing SHAs, we compare the last summarized post to the topic's highest_post_number. However, hiding or deleting a post and adding a small action will update this column, causing the job to stall and re-generate the same summary repeatedly until someone posts a regular reply. On top of this, this is not always true for topics with `best_replies`, as this last reply isn't necessarily included.

Since this is not evident at first glance and each summarization strategy picks its targets differently, I'm opting to simplify the backfill logic and how we track potential candidates.

The first step is dropping `content_range`, which serves no purpose and it's there because summary caching was supposed to work differently at the beginning. So instead, I'm replacing it with a column called `highest_target_number`, which tracks `highest_post_number` for topics and could track other things like channel's `message_count` in the future.

Now that we have this column when selecting every potential backfill candidate, we'll check if the summary is truly outdated by comparing the SHAs, and if it's not, we just update the column and move on
@romanrizzi romanrizzi force-pushed the summary_backfill_resilience branch from f0c202a to 20350af Compare January 15, 2025 20:07
@SamSaffron
Copy link
Member

Generally I think this is OK, one challenge is going to be stuff like documentation topics, where there is huge amounts of churn on the topic being summarized but backfill does not pick up on it.

I would go ahead and merge this to stop the bleeding but long term we may need to think about this problem

@romanrizzi romanrizzi merged commit 46fcdb6 into main Jan 16, 2025
6 checks passed
@romanrizzi romanrizzi deleted the summary_backfill_resilience branch January 16, 2025 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants