Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add liveness/readiness probes to web/task - fixes #414 #1188

Closed
wants to merge 10 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions config/crd/bases/awx.ansible.com_awxs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -579,6 +579,86 @@ spec:
image_pull_secret: # deprecated
description: (Deprecated) Image pull secret for app and database containers
type: string
web_liveness_initial_delay:
description: Number of seconds after the container has started before startup.
type: integer
default: 3
web_liveness_period:
description: How often (in seconds) to perform the probe.
type: integer
default: 3
web_liveness_failure_threshold:
description: Consecutive failure for the probe to be considered failed.
type: integer
default: 3
web_liveness_success_threshold:
description: Minimum consecutive successes for the probe to be considered successful after having failed.
type: integer
default: 1
web_liveness_timeout:
description: Number of seconds after which the probe times out.
type: integer
default: 10
web_readiness_initial_delay:
description: Number of seconds after the container has started before startup
type: integer
default: 3
web_readiness_period:
description: How often (in seconds) to perform the probe.
type: integer
default: 3
web_readiness_failure_threshold:
description: Consecutive failure for the probe to be considered failed.
type: integer
default: 3
web_readiness_success_threshold:
description: Minimum consecutive successes for the probe to be considered successful after having failed.
type: integer
default: 1
web_readiness_timeout:
description: Number of seconds after which the probe times out.
type: integer
default: 5
task_liveness_initial_delay:
description: Number of seconds after the container has started before startup.
type: integer
default: 3
task_liveness_period:
description: How often (in seconds) to perform the probe.
type: integer
default: 3
task_liveness_failure_threshold:
description: Consecutive failure for the probe to be considered failed.
type: integer
default: 3
task_liveness_success_threshold:
description: Minimum consecutive successes for the probe to be considered successful after having failed.
type: integer
default: 1
task_liveness_timeout:
description: Number of seconds after which the probe times out.
type: integer
default: 10
task_readiness_initial_delay:
description: Number of seconds after the container has started before startup
type: integer
default: 3
task_readiness_period:
description: How often (in seconds) to perform the probe.
type: integer
default: 3
task_readiness_failure_threshold:
description: Consecutive failure for the probe to be considered failed.
type: integer
default: 3
task_readiness_success_threshold:
description: Minimum consecutive successes for the probe to be considered successful after having failed.
type: integer
default: 1
task_readiness_timeout:
description: Number of seconds after which the probe times out.
type: integer
default: 10
task_resource_requirements:
description: Resource requirements for the task container
properties:
Expand Down
66 changes: 66 additions & 0 deletions roles/installer/templates/deployments/deployment.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,38 @@ spec:
{% if web_extra_env -%}
{{ web_extra_env | indent(width=12, first=True) }}
{% endif %}
livenessProbe:
exec:
command:
- /usr/bin/awx-manage
- check
initialDelaySeconds: {{ web_liveness_initial_delay }}
periodSeconds: {{ web_liveness_period }}
failureThreshold: {{ web_liveness_failure_threshold }}
successThreshold: {{ web_liveness_success_threshold }}
timeoutSeconds: {{ web_liveness_timeout }}
readinessProbe:
httpGet:
path: /api/v2/ping/
port: 8052
scheme: HTTP
initialDelaySeconds: {{ web_readiness_initial_delay }}
periodSeconds: {{ web_readiness_period }}
failureThreshold: {{ web_readiness_failure_threshold }}
successThreshold: {{ web_readiness_success_threshold }}
timeoutSeconds: {{ web_readiness_timeout }}
startupProbe:
exec:
command:
- /bin/bash
- -c
- |
! awx-manage showmigrations | grep '\[ \]'
initialDelaySeconds: 5
periodSeconds: 3
failureThreshold: 900
successThreshold: 1
timeoutSeconds: 5
resources: {{ web_resource_requirements }}
- image: '{{ _image }}'
name: '{{ ansible_operator_meta.name }}-task'
Expand Down Expand Up @@ -316,6 +348,40 @@ spec:
{% if task_extra_env -%}
{{ task_extra_env | indent(width=12, first=True) }}
{% endif %}
livenessProbe:
exec:
command:
- /bin/bash
- -c
- |
awx-manage run_dispatcher --running | grep '\[\]'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will result exit code 1 in case of running jobs, as shown below

bash-5.1# awx-manage run_dispatcher --running 
2023-02-01 11:40:24,449 WARNING  [-] awx.main.dispatch checking dispatcher running for awx-8495cbdf8d-ph8gp
['02b0dcb3-3d12-4f37-9ce8-28ffb61bee6e', '9abcb670-4140-4a4f-b7df-5c68017ecbd3', 'aa18e470-14f5-4735-9a62-4d7383b3946e']
bash-5.1# awx-manage run_dispatcher --running | grep '\[\]'
2023-02-01 11:40:29,592 WARNING  [-] awx.main.dispatch checking dispatcher running for awx-8495cbdf8d-ph8gp
bash-5.1# echo $?
1

So the probe will fail.

Maybe we should rely on status command awx-manage run_dispatcher --status

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested "awx-manage run_dispatcher --status" and it seems to work.

returns 1 status when connection to database is refused

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me is not correct. I tried now to simulate a failover.
This is from awx-task log
image

this with command suggested, before/after failover:
image

This before failover with running command:
image

This after failover with running command:
image

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanganellilore can you explain how you stimulated a failover?

i did it by redirecting all traffic to database on pod to localhost using iptables.

Copy link
Contributor

@tanganellilore tanganellilore Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a Ha external postgres with vip address on leader. Simply stop the leader and waiting the election of replica (5/10seconds).
This jeans that connection is really dropped and it's required a new connection from awx-task.
Actually I'm trying to set a reconnection logic on awx, but is not very simple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So every time the election process will accure the awx-task needs to reload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway, it seems we are not really sure which command to use to set in the liveness probe command.

maybe we need to mix them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can combine both command, like or condition. If one of the is 1, we can count it as broke container

initialDelaySeconds: {{ task_liveness_initial_delay }}
periodSeconds: {{ task_liveness_period }}
failureThreshold: {{ task_liveness_failure_threshold }}
successThreshold: {{ task_liveness_success_threshold }}
timeoutSeconds: {{ task_liveness_timeout }}
readinessProbe:
exec:
command:
- /usr/bin/awx-manage
- check
initialDelaySeconds: {{ task_readiness_initial_delay }}
periodSeconds: {{ task_readiness_period }}
failureThreshold: {{ task_readiness_failure_threshold }}
successThreshold: {{ task_readiness_success_threshold }}
timeoutSeconds: {{ task_readiness_timeout }}
startupProbe:
exec:
command:
- /bin/bash
- -c
- |
! awx-manage showmigrations | grep '\[ \]'
initialDelaySeconds: 5
periodSeconds: 3
failureThreshold: 900
successThreshold: 1
timeoutSeconds: 5
resources: {{ task_resource_requirements }}
- image: '{{ _control_plane_ee_image }}'
name: '{{ ansible_operator_meta.name }}-ee'
Expand Down