Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vtbackup in scheduled backups #658

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

frouioui
Copy link
Member

@frouioui frouioui commented Jan 28, 2025

Context

This Pull Request modifies the scheduled backup feature that was introduced in #553. The implementation in the original PR is rather simple and was a "v1". The CRD VitessBackupSchedule was introduced, giving the option to end-users to configure different VitessBackupScheduleStrategy, which instruct vitess-operator which keyspace/shard the user wants to backup. All the strategies in a VitessBackupSchedule were merged together to form a single vtctldclient command that would be executed in a K8S's Job following the schedule set in the VitessBackupSchedule CRD.

While this approach works, it has limitation: it uses vtctldclient which will temporarily take out a serving tablet from the serving tablet pool and use it to take the backup. While this limitation is documented in the release notes, we want to encourage people to use vtbackup as an alternative.

Proposed Change

This Pull Request modifies the VitessBackupSchedule controller to create one Kubernetes Job per VitessBackupScheduleStrategy defined the VitessBackupSchedule by the user - instead of merging all strategies into a single Job. Each Job will now create a vtbackup pod - instead of creating a pod where a vtctldclient command was executed.

Changes

  • Since vtbackup can only backup a single keyspace/shard, we need one Job per strategy, meaning that now each VitessBackupSchedule manages multiple Job resources (one per strategy), which was not the case before.

  • The previous code logic relied heavily on the CRD's Status to schedule next runs, this logic was removed to make Status purely informative to only the end users, not the vitess-operator. Moreover, the previous logic was updating the Status of the resource on, almost, every requests without knowing if we actually need to update the Status. This was modified to only update the Status when needed and make it ignore conflicts.

  • The reconciling loop of VitessBackupSchedule was, at times, not returning the correct result+error expected by the controller-runtime TypedReconciler interface, leading to unnecessary re-queue of requests or missed re-queues. The logic has been modified to return correct values. A periodic resync of the controller has also been added to safeguard potential issues and to ensure resources (Jobs, PVCs, etc) are cleaned up correctly no matter what.

  • The upgrade test has been modified to also run the 401_scheduled_backups.yaml manifest.

  • The default ConcurrentPolicy of the Jobs has been modified from Allow to Forbid. With the new strategies requiring far more resources and time to complete, this new default makes sense. This value is still configurable by end-users.

  • Previously, the vtbackup created with every VitessShard was always attempting to do an initial backup (--initial_backup), which would fail if the VitessBackup definition was added after the VitessShard started running. That logic has been slightly modified so the vitess-operator is more aware of the current state of the VitessShard before creating the vtbackup pod. If there is already a primary in the shard, it will not be an initial backup, but an --allow_first_backup backup - as we only create this pod if there is no backup for this shard. The VitessBackupSchedule relies on the same logic to determine the type of backup it has to run, either: initial, allow first, or normal. Concurrency issues between the initial vtbackup pod and the first scheduled run of a VitessBackupSchedule is, depending on the schedule, likely. In such situation both vtbackup pods come up with the --initial_backup flag set to true. One of the two pods will simply fail early.

Related PRs

Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Copy link
Collaborator

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! ❤️ I only had a few minor comments/questions.

)

var (
maxConcurrentReconciles = flag.Int("vitessbackupschedule_concurrent_reconciles", 10, "the maximum number of different vitessbackupschedule resources to reconcile concurrently")
resyncPeriod = flag.Duration("vitessbackupschedule_resync_period", 1*time.Minute, "reconcile vitessbackupschedules with this period even if no Kubernetes events occur")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use dashes for new flags in the operator as well? We didn't do that for the one above though, so I guess we can keep it consistent 🙂

Comment on lines +568 to +571
if err != nil {
if !apierrors.IsNotFound(err) {
return nil, err
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be:

		if err != nil && !apierrors.IsNotFound(err) {
			return nil, err
		}

Then the rest of the code would be more clear and easier to read IMO. If we don't get an error, meaning that it already exists, we just re-use it? Do we ensure it's in a certain state in this case?

if err != nil {
return corev1.PodSpec{}, err
}
_, completeBackups, err := vitessbackup.GetBackups(ctx, vbsc.Namespace, vbsc.Spec.Cluster, strategy.Keyspace, vkr.SafeName(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but I think the var name should be completedBackups, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants