Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add retries to requests sessions #14740

Merged
merged 1 commit into from
Jan 24, 2024

Conversation

kwevers
Copy link
Contributor

@kwevers kwevers commented Dec 28, 2023

SUMMARY

Every so often we get connection timed out errors towards our HCP Vault endpoint. This is usually when a larger number of jobs is running simultaneously. Considering requests for other jobs do still succeed this is probably load related and adding a retry should help in making this a bit more robust.

A structural fix for our specific env is probably better spreading of jobs / scaling of HCP Vault nodes, but this should already improve things.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • Other
AWX VERSION
N/A

We're running Ansible Automation Platform Controller 4.4.8

ADDITIONAL INFORMATION
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 515, in run
    private_data_files, ssh_key_data = self.build_private_data_files(self.instance, private_data_dir)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 249, in build_private_data_files
    private_data = self.build_private_data(instance, private_data_dir)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 1151, in build_private_data
    private_data['credentials'][credential] = credential.get_input('ssh_key_data', default='')
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/credential/__init__.py", line 279, in get_input
    return self._get_dynamic_input(field_name)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/credential/__init__.py", line 313, in _get_dynamic_input
    return input_source.get_input_value()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/models/credential/__init__.py", line 1260, in get_input_value
    return backend(**backend_kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/credential_plugins/hashivault.py", line 209, in kv_backend
    token = handle_auth(**kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/credential_plugins/hashivault.py", line 164, in handle_auth
    token = method_auth(**kwargs, auth_param=approle_auth(**kwargs))
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/credential_plugins/hashivault.py", line 202, in method_auth
    resp = sess.post(request_url, **request_kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/requests/sessions.py", line 635, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/requests/adapters.py", line 553, in send
    raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='vault-cluster-$REDACTED.hashicorp.cloud', port=8200): Max retries exceeded with url: /v1/auth/approle/login (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f80b5594430>, 'Connection to vault-cluster-$REDACTED.hashicorp.cloud timed out. (connect timeout=30)'))

@kwevers kwevers marked this pull request as ready for review December 28, 2023 14:55
@@ -227,6 +227,7 @@ def method_auth(**kwargs):
cacert = kwargs.get('cacert', None)

sess = requests.Session()
sess.mount(url, requests.adapters.HTTPAdapter(max_retries=5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does HTTPAdapter have backoff between retries? or does it wait a fixed amount of time between posts/gets? is this configurable?

what http code triggers the retry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTPAdapter passes the max_retries argument to urlopen from urllib3.poolmanager.PoolManager.
Based on the description at https://urllib3.readthedocs.io/en/stable/reference/urllib3.connectionpool.html#urllib3.HTTPConnectionPool.urlopen there is no builtin back-off behaviour, unless we use a Retry object in which case all of this is configurable.

max_retries is only used for things like DNS lookup errors, connection time outs, ... as soon as the request hit the server this doesn't apply anymore.
See https://docs.python-requests.org/en/latest/api/#requests.adapters.HTTPAdapter

@@ -263,6 +264,7 @@ def kv_backend(**kwargs):
}

sess = requests.Session()
sess.mount(url, requests.adapters.HTTPAdapter(max_retries=5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the sess.post below are already in a retry loop, so I'm wondering if this mount is needed, or how it will work with the retry logic below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max_retries for HTTPAdapter are only applicable for connection level errors while the retry loop is only relevant in case of an HTTP 412 (returned in case a resource in Vault isn't replicated yet).

@fosterseth
Copy link
Member

fosterseth commented Jan 3, 2024

thanks for this contribution, I dropped a few comments

have you tested this out at all? does it seem to help reduce getting connection timeouts when using the plugin?

also how many jobs do you have starting at once (when you see the timeout errors)?

Do you think staggering job starts might be a more generalized solution to this type of problem? We suspect other plugins could have similar problems to this one

@kwevers
Copy link
Contributor Author

kwevers commented Jan 8, 2024

@fosterseth Thanks for the feedback.

We're currently using these modifications on our PRD Ansible Automation Platform instance. Since making these changes we've not ran into issues with this anymore. Before we had this at least a few times per week. The amount / start time of jobs ran is pretty much the same as most of our jobs are scheduled.

We usually have about 100 jobs starting at the same time when we encounter this.

Staggering jobs definitely sounds like a better solution. We're currently doing manual "load distribution" by offsetting start times for most jobs by some logical grouping of inventories (seems like we missed some, hence the 100 simultaneous jobs), but that's rather cumbersome. It would be great if we can schedule all jobs of a certain type at the same time and have AWX/AAP add them to some sort of queue and process them when possible.

@chrismeyersfsu
Copy link
Member

Hey @kwevers we played around a bit with backoff. We found the retry behavior of your PR (without the use of a "custom" Retry) to be desirable. Just leaving this as a note w/ the code we used to observe the behavior.

Simulate hitting a connection timeout and triggering the backoff timeout

  • Open 1 terminal
  • run python3 main.py

Simulate hitting the timeout= and trigger the backoff timeout

  • Open 2 terminals
  • run nc -l 4051 -k
  • run python3 main.py

Simulate a 404 and see that there is no retry

  • Open 2 terminals
  • run python3 -m http.server 4051
  • run python3 main.py
# main.py
import requests
from urllib3.util.retry import Retry

sess = requests.Session()
r = Retry(backoff_factor=.1, total=3)
adapter = requests.adapters.HTTPAdapter(max_retries=r)
sess.mount('http://', adapter)

res = sess.get('http://localhost:4051/hello', timeout=3)

Every so often we get connection timed out errors towards our HCP Vault
endpoint. This is usually when a larger number of jobs is running
simultaneously. Considering requests for other jobs do still succeed this
is probably load related and adding a retry should help in making this a
bit more robust.
@fosterseth fosterseth force-pushed the bugfix/hashivault-retries branch from 9cebe78 to 7d930d9 Compare January 24, 2024 20:27
@fosterseth fosterseth enabled auto-merge (rebase) January 24, 2024 20:28
@fosterseth fosterseth merged commit d4f7bfe into ansible:devel Jan 24, 2024
18 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants