Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOMKilled maintenance job pods are getting piled up even if --keep-latest-maintenance-jobs=0 #8593

Open
navilg opened this issue Jan 8, 2025 · 16 comments

Comments

@navilg
Copy link

navilg commented Jan 8, 2025

What steps did you take and what happened:

We have velero v1.14.0 deployed with --keep-latest-maintenance-jobs=0. When maintenance job is completing fine, job pods are deleted fine, but when maintenance job pod is OOMKilled, pods stay in cluster.

image

What did you expect to happen:

Maintenance job pods should have been cleaned up automatically because of --keep-latest-maintenance-jobs=0

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

Environment:

  • Velero version (use velero version): 1.14.0
  • Velero features (use velero client config get features): NA
  • Kubernetes version (use kubectl version): 1.30
  • Kubernetes installer & version: GKE
  • Cloud provider or hardware configuration: GCP
  • OS (e.g. from /etc/os-release): Ubuntu

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@sseago
Copy link
Collaborator

sseago commented Jan 8, 2025

Looks like the velero install code only passes it in if the value isn't zero, so we treat 0 and (absent) as equivalent. The fix may be as simple as passing in the value if specified, and only leaving it off if the user left it off.

@Lyndon-Li
Copy link
Contributor

@navilg
Could you check if the maintenance job objects are also left by running:
kubectl -n velero get job

@navilg
Copy link
Author

navilg commented Jan 9, 2025

@navilg
Could you check if the maintenance job objects are also left by running:
kubectl -n velero get job

@Lyndon-Li Yes. They are also left in "Failed" state.

@Lyndon-Li
Copy link
Contributor

Could you describe the left job and share the output? Can you see deletionTimestamp in the job?

@navilg
Copy link
Author

navilg commented Jan 9, 2025

Job is automatically deleted now along with its pod. Let me see if I can share the details once this issue happen again.

@navilg
Copy link
Author

navilg commented Jan 9, 2025

is there any timeout value for maintenance job which may have cleared the job/pod ? When I checked yesterday job pod was still there for last 46h.

@Lyndon-Li
Copy link
Contributor

There is no timeout from Velero.
Just test and share this info #8593 (comment) when you reproduce it.

@navilg
Copy link
Author

navilg commented Jan 9, 2025

describe-kopia-maintenance-job.txt

@Lyndon-Li I have uploaded the job description from another cluster where I see same issue with same configuration. I don't see deletion timestamp anywhere in job description. In this cluster I see 5 maintenance job pod in OOMKilled status and not cleaned.
image

@blackpiglet
Copy link
Contributor

Could you please check whether there were any error log as the following in the Velero pod?

Failed to delete maintenance job

@navilg
Copy link
Author

navilg commented Jan 10, 2025

No. I don't see any log entry related to maintenance job in velero pod. I only see below log entries related to maintenance in velero pod

time="2025-01-10T06:17:45Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/ausa-default-kopia-fftnf logSource="pkg/controller/backup_repository_controller.go:293"
time="2025-01-10T06:17:46Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/ausa-oob-default-kopia-vchc5 logSource="pkg/controller/backup_repository_controller.go:293"
time="2025-01-10T06:17:46Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/nginx-example-default-kopia-cgk2b logSource="pkg/controller/backup_repository_controller.go:293"
time="2025-01-10T06:17:46Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/trident-default-kopia-tsrpp logSource="pkg/controller/backup_repository_controller.go:293"
time="2025-01-10T06:32:47Z" level=warning msg="Found too many index blobs (1385), this may result in degraded performance.\n\nPlease ensure periodic repository maintenance is enabled or run 'kopia maintenance'." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[shared-manager]" sublevel=error
time="2025-01-10T06:32:48Z" level=warning msg="Found too many index blobs (1385), this may result in degraded performance.\n\nPlease ensure periodic repository maintenance is enabled or run 'kopia maintenance'." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[shared-manager]" sublevel=error
time="2025-01-10T07:17:46Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/nginx-example-default-kopia-cgk2b logSource="pkg/controller/backup_repository_controller.go:293"
time="2025-01-10T07:17:46Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/trident-default-kopia-tsrpp logSource="pkg/controller/backup_repository_controller.go:293"
time="2025-01-10T07:17:46Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/ausa-default-kopia-fftnf logSource="pkg/controller/backup_repository_controller.go:293"
time="2025-01-10T07:17:46Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/ausa-oob-default-kopia-vchc5 logSource="pkg/controller/backup_repository_controller.go:293"
time="2025-01-10T08:17:46Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/trident-default-kopia-tsrpp logSource="pkg/controller/backup_repository_controller.go:293"
time="2025-01-10T08:17:46Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/ausa-default-kopia-fftnf logSource="pkg/controller/backup_repository_controller.go:293"
time="2025-01-10T08:17:46Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/ausa-oob-default-kopia-vchc5 logSource="pkg/controller/backup_repository_controller.go:293"
time="2025-01-10T08:17:46Z" level=info msg="Running maintenance on backup repository" backupRepo=velero/nginx-example-default-kopia-cgk2b logSource="pkg/controller/backup_repository_controller.go:293"

@navilg
Copy link
Author

navilg commented Jan 12, 2025

@Lyndon-Li @blackpiglet Were you able find anything with info provided

@blackpiglet
Copy link
Contributor

@navilg
Is there any error message in the corresponding BackupRepository CR?

@navilg
Copy link
Author

navilg commented Jan 20, 2025

@blackpiglet Sorry I was out for the week.
I do not see any error in backup repository. At this time I see last maintenance which ran is 30 min ago. And last maintenance job pod in OOMKilled status is from 7 days ago.
Currently, There is also one maintenance job which is running since last 24h.

@blackpiglet
Copy link
Contributor

blackpiglet commented Jan 21, 2025

I didn't reproduce this issue in my environment.
I changed the kept job number from default to 0 in the Velero deployment, then all the jobs failed with OOMKilled are deleted.

``` bash
wget https://github.com/vmware-tanzu/velero/releases/download/v1.14.1/velero-v1.14.1-darwin-arm64.tar.gz
tar zxvf  velero-v1.14.1-darwin-arm64.tar.gz

$HOME/Downloads/velero-v1.14.1-darwin-arm64/velero install \
    --default-repo-maintain-frequency=2m \
    --maintenance-job-cpu-limit=200m \
    --maintenance-job-cpu-request=100m \
    --maintenance-job-mem-limit=100Mi \
    --maintenance-job-mem-request=50Mi \
    --keep-latest-maintenance-jobs=0 \
    --provider gcp \
    --bucket jxun \
    --secret-file $HOME/Library/CloudStorage/Box-Box/Documents/credentials/credentials-velero-gcp \
    --features=EnableCSI \
    --image velero/velero:v1.14.1 \
    --plugins velero/velero-plugin-for-gcp:v1.10.1 \
    --use-node-agent

k create ns kibishii

k apply -n kibishii -k $HOME/go/src/github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/gcp/ 

kubectl exec \
 -n kibishii \
 jump-pad -- /usr/local/bin/generate.sh 2 10 10 1024 1024 0 2

$HOME/Downloads/velero-v1.14.1-darwin-arm64/velero backup create --include-namespaces=kibishii --snapshot-move-data 8593-01

``` bash
jxun@DH7PKQMYXW:~/go/src/github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml (main*) $ k -n velero edit deploy velero
deployment.apps/velero edited
jxun@DH7PKQMYXW:~/go/src/github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml (main*) $ k -n velero get job
NAME                                                      STATUS    COMPLETIONS   DURATION   AGE
kibishii-default-kopia-tbsr2-maintain-job-1737451449977   Failed    0/1           5m42s      5m43s
kibishii-default-kopia-tbsr2-maintain-job-1737451459547   Failed    0/1           5m33s      5m33s
kibishii-default-kopia-tbsr2-maintain-job-1737451749770   Failed    0/1           43s        43s
kibishii-default-kopia-tbsr2-maintain-job-1737451791020   Running   0/1           1s         1s
jxun@DH7PKQMYXW:~/go/src/github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml (main*) $ k -n velero get pods
NAME                                                            READY   STATUS      RESTARTS   AGE
kibishii-default-kopia-tbsr2-maintain-job-1737451449977-q7rcg   0/1     OOMKilled   0          5m48s
kibishii-default-kopia-tbsr2-maintain-job-1737451459547-7tlwr   0/1     OOMKilled   0          5m39s
kibishii-default-kopia-tbsr2-maintain-job-1737451749770-v4lls   0/1     OOMKilled   0          49s
kibishii-default-kopia-tbsr2-maintain-job-1737451791020-qb4nc   0/1     OOMKilled   0          7s
node-agent-4v2kd                                                1/1     Running     0          95m
node-agent-d5sdk                                                1/1     Running     0          95m
velero-d568fc76c-q5tz9                                          1/1     Running     0          10s
jxun@DH7PKQMYXW:~/go/src/github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml (main*) $ k -n velero get pods
NAME                     READY   STATUS    RESTARTS   AGE
node-agent-4v2kd         1/1     Running   0          95m
node-agent-d5sdk         1/1     Running   0          95m
velero-d568fc76c-q5tz9   1/1     Running   0          15s
jxun@DH7PKQMYXW:~/go/src/github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml (main*) $ k -n velero get pods
NAME                     READY   STATUS    RESTARTS   AGE
node-agent-4v2kd         1/1     Running   0          95m
node-agent-d5sdk         1/1     Running   0          95m
velero-d568fc76c-q5tz9   1/1     Running   0          19s
jxun@DH7PKQMYXW:~/go/src/github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml (main*) $ 

@blackpiglet
Copy link
Contributor

blackpiglet commented Jan 21, 2025

Is it OK I give a debug version of Velero for you to help test it?
Or you can enable the Velero server's debug log level.
Add this line into the Velero deployment's args list.
- --log-level=debug

@navilg
Copy link
Author

navilg commented Jan 25, 2025

Is it OK I give a debug version of Velero for you to help test it?
Or you can enable the Velero server's debug log level.
Add this line into the Velero deployment's args list.
- --log-level=debug

Yes I can enable velero seever debug log and share the logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants