-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add retries to requests sessions #14740
Conversation
@@ -227,6 +227,7 @@ def method_auth(**kwargs): | |||
cacert = kwargs.get('cacert', None) | |||
|
|||
sess = requests.Session() | |||
sess.mount(url, requests.adapters.HTTPAdapter(max_retries=5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does HTTPAdapter have backoff between retries? or does it wait a fixed amount of time between posts/gets? is this configurable?
what http code triggers the retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HTTPAdapter
passes the max_retries
argument to urlopen
from urllib3.poolmanager.PoolManager
.
Based on the description at https://urllib3.readthedocs.io/en/stable/reference/urllib3.connectionpool.html#urllib3.HTTPConnectionPool.urlopen there is no builtin back-off behaviour, unless we use a Retry
object in which case all of this is configurable.
max_retries
is only used for things like DNS lookup errors, connection time outs, ... as soon as the request hit the server this doesn't apply anymore.
See https://docs.python-requests.org/en/latest/api/#requests.adapters.HTTPAdapter
@@ -263,6 +264,7 @@ def kv_backend(**kwargs): | |||
} | |||
|
|||
sess = requests.Session() | |||
sess.mount(url, requests.adapters.HTTPAdapter(max_retries=5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the sess.post below are already in a retry loop, so I'm wondering if this mount is needed, or how it will work with the retry logic below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max_retries
for HTTPAdapter
are only applicable for connection level errors while the retry loop is only relevant in case of an HTTP 412 (returned in case a resource in Vault isn't replicated yet).
thanks for this contribution, I dropped a few comments have you tested this out at all? does it seem to help reduce getting connection timeouts when using the plugin? also how many jobs do you have starting at once (when you see the timeout errors)? Do you think staggering job starts might be a more generalized solution to this type of problem? We suspect other plugins could have similar problems to this one |
@fosterseth Thanks for the feedback. We're currently using these modifications on our PRD Ansible Automation Platform instance. Since making these changes we've not ran into issues with this anymore. Before we had this at least a few times per week. The amount / start time of jobs ran is pretty much the same as most of our jobs are scheduled. We usually have about 100 jobs starting at the same time when we encounter this. Staggering jobs definitely sounds like a better solution. We're currently doing manual "load distribution" by offsetting start times for most jobs by some logical grouping of inventories (seems like we missed some, hence the 100 simultaneous jobs), but that's rather cumbersome. It would be great if we can schedule all jobs of a certain type at the same time and have AWX/AAP add them to some sort of queue and process them when possible. |
Hey @kwevers we played around a bit with backoff. We found the retry behavior of your PR (without the use of a "custom" Simulate hitting a connection timeout and triggering the backoff timeout
Simulate hitting the
|
Every so often we get connection timed out errors towards our HCP Vault endpoint. This is usually when a larger number of jobs is running simultaneously. Considering requests for other jobs do still succeed this is probably load related and adding a retry should help in making this a bit more robust.
9cebe78
to
7d930d9
Compare
SUMMARY
Every so often we get connection timed out errors towards our HCP Vault endpoint. This is usually when a larger number of jobs is running simultaneously. Considering requests for other jobs do still succeed this is probably load related and adding a retry should help in making this a bit more robust.
A structural fix for our specific env is probably better spreading of jobs / scaling of HCP Vault nodes, but this should already improve things.
ISSUE TYPE
COMPONENT NAME
AWX VERSION
We're running Ansible Automation Platform Controller 4.4.8
ADDITIONAL INFORMATION