-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use vtbackup in scheduled backups #658
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! ❤️ I only had a few minor comments/questions.
) | ||
|
||
var ( | ||
maxConcurrentReconciles = flag.Int("vitessbackupschedule_concurrent_reconciles", 10, "the maximum number of different vitessbackupschedule resources to reconcile concurrently") | ||
resyncPeriod = flag.Duration("vitessbackupschedule_resync_period", 1*time.Minute, "reconcile vitessbackupschedules with this period even if no Kubernetes events occur") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we use dashes for new flags in the operator as well? We didn't do that for the one above though, so I guess we can keep it consistent 🙂
if err != nil { | ||
if !apierrors.IsNotFound(err) { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just be:
if err != nil && !apierrors.IsNotFound(err) {
return nil, err
}
Then the rest of the code would be more clear and easier to read IMO. If we don't get an error, meaning that it already exists, we just re-use it? Do we ensure it's in a certain state in this case?
if err != nil { | ||
return corev1.PodSpec{}, err | ||
} | ||
_, completeBackups, err := vitessbackup.GetBackups(ctx, vbsc.Namespace, vbsc.Spec.Cluster, strategy.Keyspace, vkr.SafeName(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but I think the var name should be completedBackups
, no?
Context
This Pull Request modifies the scheduled backup feature that was introduced in #553. The implementation in the original PR is rather simple and was a "v1". The CRD
VitessBackupSchedule
was introduced, giving the option to end-users to configure differentVitessBackupScheduleStrategy
, which instruct vitess-operator which keyspace/shard the user wants to backup. All the strategies in aVitessBackupSchedule
were merged together to form a singlevtctldclient
command that would be executed in a K8S'sJob
following the schedule set in theVitessBackupSchedule
CRD.While this approach works, it has limitation: it uses
vtctldclient
which will temporarily take out a serving tablet from the serving tablet pool and use it to take the backup. While this limitation is documented in the release notes, we want to encourage people to usevtbackup
as an alternative.Proposed Change
This Pull Request modifies the
VitessBackupSchedule
controller to create one KubernetesJob
perVitessBackupScheduleStrategy
defined theVitessBackupSchedule
by the user - instead of merging all strategies into a singleJob
. EachJob
will now create avtbackup
pod - instead of creating a pod where avtctldclient
command was executed.Changes
Since
vtbackup
can only backup a single keyspace/shard, we need oneJob
per strategy, meaning that now eachVitessBackupSchedule
manages multipleJob
resources (one per strategy), which was not the case before.The previous code logic relied heavily on the CRD's Status to schedule next runs, this logic was removed to make Status purely informative to only the end users, not the vitess-operator. Moreover, the previous logic was updating the Status of the resource on, almost, every requests without knowing if we actually need to update the Status. This was modified to only update the Status when needed and make it ignore conflicts.
The reconciling loop of
VitessBackupSchedule
was, at times, not returning the correct result+error expected by the controller-runtimeTypedReconciler
interface, leading to unnecessary re-queue of requests or missed re-queues. The logic has been modified to return correct values. A periodic resync of the controller has also been added to safeguard potential issues and to ensure resources (Jobs
, PVCs, etc) are cleaned up correctly no matter what.The upgrade test has been modified to also run the
401_scheduled_backups.yaml
manifest.The default
ConcurrentPolicy
of theJobs
has been modified fromAllow
toForbid
. With the new strategies requiring far more resources and time to complete, this new default makes sense. This value is still configurable by end-users.Previously, the
vtbackup
created with everyVitessShard
was always attempting to do an initial backup (--initial_backup
), which would fail if theVitessBackup
definition was added after theVitessShard
started running. That logic has been slightly modified so the vitess-operator is more aware of the current state of theVitessShard
before creating thevtbackup
pod. If there is already a primary in the shard, it will not be an initial backup, but an--allow_first_backup
backup - as we only create this pod if there is no backup for this shard. TheVitessBackupSchedule
relies on the same logic to determine the type of backup it has to run, either: initial, allow first, or normal. Concurrency issues between the initialvtbackup
pod and the first scheduled run of aVitessBackupSchedule
is, depending on the schedule, likely. In such situation bothvtbackup
pods come up with the--initial_backup
flag set to true. One of the two pods will simply fail early.Related PRs
vtbackup
in scheduled backups vitessio/website#1946vtbackup
in scheduled backups vitessio/vitess#17869