-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOMKilled maintenance job pods are getting piled up even if --keep-latest-maintenance-jobs=0 #8593
Comments
Looks like the velero install code only passes it in if the value isn't zero, so we treat 0 and (absent) as equivalent. The fix may be as simple as passing in the value if specified, and only leaving it off if the user left it off. |
@navilg |
@Lyndon-Li Yes. They are also left in "Failed" state. |
Could you describe the left job and share the output? Can you see deletionTimestamp in the job? |
Job is automatically deleted now along with its pod. Let me see if I can share the details once this issue happen again. |
is there any timeout value for maintenance job which may have cleared the job/pod ? When I checked yesterday job pod was still there for last 46h. |
There is no timeout from Velero. |
describe-kopia-maintenance-job.txt @Lyndon-Li I have uploaded the job description from another cluster where I see same issue with same configuration. I don't see deletion timestamp anywhere in job description. In this cluster I see 5 maintenance job pod in OOMKilled status and not cleaned. |
Could you please check whether there were any error log as the following in the Velero pod?
|
No. I don't see any log entry related to maintenance job in velero pod. I only see below log entries related to maintenance in velero pod
|
@Lyndon-Li @blackpiglet Were you able find anything with info provided |
@navilg |
@blackpiglet Sorry I was out for the week. |
I didn't reproduce this issue in my environment. ``` bash
wget https://github.com/vmware-tanzu/velero/releases/download/v1.14.1/velero-v1.14.1-darwin-arm64.tar.gz
tar zxvf velero-v1.14.1-darwin-arm64.tar.gz
$HOME/Downloads/velero-v1.14.1-darwin-arm64/velero install \
--default-repo-maintain-frequency=2m \
--maintenance-job-cpu-limit=200m \
--maintenance-job-cpu-request=100m \
--maintenance-job-mem-limit=100Mi \
--maintenance-job-mem-request=50Mi \
--keep-latest-maintenance-jobs=0 \
--provider gcp \
--bucket jxun \
--secret-file $HOME/Library/CloudStorage/Box-Box/Documents/credentials/credentials-velero-gcp \
--features=EnableCSI \
--image velero/velero:v1.14.1 \
--plugins velero/velero-plugin-for-gcp:v1.10.1 \
--use-node-agent
k create ns kibishii
k apply -n kibishii -k $HOME/go/src/github.com/vmware-tanzu-experiments/distributed-data-generator/kubernetes/yaml/gcp/
kubectl exec \
-n kibishii \
jump-pad -- /usr/local/bin/generate.sh 2 10 10 1024 1024 0 2
$HOME/Downloads/velero-v1.14.1-darwin-arm64/velero backup create --include-namespaces=kibishii --snapshot-move-data 8593-01
|
Is it OK I give a debug version of Velero for you to help test it? |
Yes I can enable velero seever debug log and share the logs. |
What steps did you take and what happened:
We have velero v1.14.0 deployed with --keep-latest-maintenance-jobs=0. When maintenance job is completing fine, job pods are deleted fine, but when maintenance job pod is OOMKilled, pods stay in cluster.
What did you expect to happen:
Maintenance job pods should have been cleaned up automatically because of --keep-latest-maintenance-jobs=0
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Anything else you would like to add:
Environment:
velero version
): 1.14.0velero client config get features
): NAkubectl version
): 1.30/etc/os-release
): UbuntuVote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: