-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added basic liveness- & readinessProbe for awx-web & awx-task containers #927
Conversation
Regarding Test-Failure: Does not look like a failure introduced with these changes, but rather the test-K8s-cluster not being reachable. |
@rooftopcellist I like this. I know you had expressed concern about not seeing the migration screen. But I don't care about that, and I don't think most people would. In OCP I don't even think the route will become available until the service is up, and the service won't be up until the pods are ready. So the route will become available when the application is ready and available -- which is the desired state. |
@moonrail I don't agree that the failure is the tests not reaching the k8s cluster the test: awx-operator/molecule/default/tasks/awx_test.yml Lines 2 to 12 in 3ac0232
the output:
The key part being:
I wonder if we need to put a |
Yes, @kdelee is right. These 3 conditions in the molecule test linked above (which CI runs) will not be met until the readiness probe is successful. The probe will not be successful until migrations are complete, which takes significantly longer than just waiting for the pod to be in the running state (what it used to be).
I would suggest raising the timeout to 1500s (25 min). Retry logic isn't needed, |
Ah, now I understand the molecule testing setup - did not fully grasp it before. |
Initial db migrations take about 5min on my local minikube setup and on our K8s test environment. The current proposed livenessProbe would restart the Container after about 95 seconds, therefore deployment cannot succeed when migrations are required. readinessProbe however is not problematic as it is, but I've misunderstood I think the current migration execution & waiting on container startup cannot be kept as is for AWX being run in K8s. With the current AWX startup scripts a livenessProbe cannot be safely implemented, without waiting an unreasonable long time to even probe (what is safe - 15min, 25min, 40min? This cannot be the solution). Startup-Scripts: Both use: https://github.com/ansible/awx/blob/21.0.0/tools/ansible/roles/dockerfile/files/wait-for-migrations Would migrations be only applied within an initContainer, liveness- & readinessProbes would only become relevant once migrations were applied successfully, as the actual container would not be started until the initContainer succeeded. An initContainer would have to be added to Waiting for migration-completion is done already on startup of What do you think? |
@moonrail your idea to run migrations in an init container sounds like a good idea to me. @shanemcd and @Zokormazo may be interested in this suggestion. It sounds like you would be interested in implementing that in this PR so that the readiness/liveness probes can work sanely? |
A really cool thing to do would be while the migrations init container is running, that we have another container that is serving that "migrations are running screen" that the service points to. Then once the actual app is up (readiness checks pass) we patch the service to point to the controller api. I know this contradicts what I said the other day about "not caring" if I see the migration screen...still trying to figure out best compromise to give human users feedback that the app is still "installing" e.g. not useble yet, but ostensibly healthy, and give automation the right feedback that the app is not yet ready |
@kdelee |
@moonrail I think just doing the init container + the readiness/liveness checks is a great start/best way to begin. If it is desireable to show some "awx is starting" screen before those pass, that seems like a distinct request |
@kdelee |
@kdelee @rooftopcellist |
@Zokormazo any way we can get a downstream test of this change? |
@moonrail Hello! We are in the process of working on merging this PR. This branch will need to be rebased prior to that. Would you mind running that rebase on your end? If not, we can certainly work toward an alternative resolution. Thank you for your time! |
…accept connections, Added basic liveness- & readinessProbe for web & task containers Fixes #926
Hi @djyasin |
Thank you so much @moonrail!! I will try running it through tests on our end now that this has been rebased. |
I'm a little worried about this change. What happens when scaling the deployment to more than 1? This is going cause the migrations to run every time a new pod is created right? |
@shanemcd When having e.g. 3 replicas and you're upgrading AWX you'll get a 4th pod that will run migrations on the database. |
Re: draining awx before upgrade, problematic if jobs are running. to @shanemcd 's concern, what about a fresh install with |
@kdelee |
@@ -77,6 +77,7 @@ | |||
apply: yes | |||
definition: "{{ lookup('template', 'deployment.yaml.j2') }}" | |||
wait: yes | |||
wait_timeout: 600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this. I feel like it's more likely to cause problems than help anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default is 120s:
https://docs.ansible.com/ansible/latest/collections/kubernetes/core/k8s_module.html#parameter-wait_timeout
Initial migration of AWX Database takes longer than 120s, therefore following tasks would be run without AWX pods being ready.
I cannot guess how long it takes in individual environments, as it depends on the circumstances such as throughput, IO and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given it may be hard to guess, I wonder if we want to make this a variable we could set
httpGet: | ||
path: /api/v2/ping/ | ||
port: 8052 | ||
timeoutSeconds: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it be configured from outside with predefined default values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1188 - something like that
that way the user will have more control over the deployment if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erz4
I agree - it just was not the aim for my PR.
I wanted this PR to be a basic minimum to add this functionality ASAP, as every change to the deployment causes downtimes & job failures.
But since this PR is still open after now nearly 8 months, that obviously did not go as desired and the problem still persists.
If your PR is merged, this can be closed. We just need these checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moonrail
did you test it locally with molecule?
my probes failing in the docker-desktop env but working on a k8s cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested with molecule but also in our test environment (k8s).
Both ran at the last try (several months ago).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup mine is failing in the migration phase as mentioned above.
So what is the status now? why is that PR stuck?
We discussed about that internally and thinking that startup probe will did the job for that case something like that for web & task startupProbe:
exec:
command: ["!", "/usr/bin/awx-manage", "showmigrations", "|", "grep", "\'\[ \]\'"]
initialDelaySeconds: 5
periodSeconds: 3
failureThreshold: 900
successThreshold: 1
timeoutSeconds: 2 |
Hello! |
As this is still an issue and open. Is there some foreseeable activity/progress on this? |
Yeah, no, this is not working. There seems to be no real interest on the maintainers side to implement this, so I'll not bother anymore and close this PR, as I am no longer willing to make any changes to it. Who knows why this basic feature is not desired, maybe to reserve it for other downstream channels, but whatever. |
Adds basic livenessProbe & readinessProbe for containers web & task.
See issue #926 for background on this.
Removes env var
AWX_SKIP_MIGRATIONS
as it was removed in AWX 18.0.0 version of launch scripts and is not used anymore.As discussed in this PR:
Adds initContainer
database-migration
that applies migrations or waits if migrations are being applied. This way no launch script can interfere and no AWX pod can come online without having migrations already applied and being unavailable.Removes
awx-operator
-based db migration handling in installer, as at this point containers already did the job.As a cause of this,
wait_timeout
had to be increased in installer for the AWX pods deployment to adhere for longer initial startup time of AWX pods when a lot/all migrations have to be applied.I've copied ca-volume-mounts from web & task containers to this new initContainer as well, as these may be required depending on the SSL certificate being used on a (external) database host.