Skip to content

Retry and delay support #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

robegan21
Copy link

added retry loop to http client when error 429 is returned
added default delay to contracts and trades based on ledgerx specs

Changes include:

Rob Egan added 3 commits April 1, 2021 00:07
added default delay to contracts and trades based on ledgerx specs
@westonplatter
Copy link
Owner

@robegan21 overall, I really like the code changes. thanks for contributing the work. I'll try to look through and use the changes locally this week.

@westonplatter
Copy link
Owner

@robegan21 Looked at the additions. Overall, I think rate limiting logic is needed, but the client isn't the right place for it.

Half way through the code, I started wondering if adding sleeps in here is really helpful long term.

The use case in my mind is a small, crypto hedge fund that collects data within a background jobs framework (eg, Celery, Airflow, or Argo). I imagine that the master process has somewhere between 10-100 workers fetching data from Ledgerx, and there's likely a rate limited job queue specific to each rate limited endpoint.

A celery specific example,

task_annotations = {
  'tasks.ledgerx_contracts_list': {'rate_limit': '10/m'},
  'tasks.ledgerx_trades_global': {'rate_limit': '10/m'},

  # implicitly, all other tasks not listed here have no rate_limit
}

In the current PR, the ledgerx client is responsible for deciding how/when to retry rate limited requests. This means that within a specific job run, the ledgerx client might hit a rate limit error, retry the request, and then (for this example) successfully complete the API request.

The downside of this is that the master process doesn't get the chance to govern how and when the 2nd API fetch happens and thus bypasses its rate limiting configs. This then impacts all other jobs in the rate limited queue. While this may work, it makes it harder to debug since a developer would need to consider try logic in (1) the job execution framework, (2) the client, and (3) the combination of both 1 and 2.

My opinion is that while the rate limiting logic in this PR is helpful, it's better for it to live in job execution configs - not the client.

@robegan21
Copy link
Author

@westonplatter Thanks for response. I'd say that my use case is different from yours and does not include a larger workflow manager... I'm building a stand-alone server and intend to submit a few more pull requests, one to support the websocket from ledgerx and the other to track and model the entire market state, if you are interested in those.

For this PR, I took your advice and removed the default delays for the API handles that have rate limits, BUT included static class variables that can be overridden for my stand-alone use cases.

In my use case, I've found that the automatic retry when 429 errors are returned is essential for stability and correctness.

@robegan21
Copy link
Author

And similarly, I added a static HttpClient.RETRY_429_ERRORS variable to control the behavior of that utility and any encountered 429 errors so that the default would not interfere with a workflow manager's decisions.

@westonplatter
Copy link
Owner

@robegan21 sorry for the long delay.

I can appreciate the functionality you're working on, however, ledgerx-python isn't the right place for it. Adding variables and/or hooks in ledgerx-python which are not used creates more overhead and complexity for the codebase. I think the functionality you're describing would work really well in a library/module that uses ledgerx-python as a dependency.

I'm going to close this PR. If you think it's worth talking about more let's do a google hangout.

@robegan21
Copy link
Author

Hi Weston,

I'm happy to discuss out of band, instead of in this issue tracker, but consider the list_all method in GenericResource class which uses the HttpClient::get method.

IMO, The "list_all" method, should either return all the results or fail entirely and throw an exception, which is the case in the master version. Unfortunately in my case the list_all function only succeeded sometimes and I discovered this was because of the 429 rate limiting errors... if the requested amount of data returning is so long that it requires more http get requests then the ledgerx server allows in the time window, list_all could never return successfully.

So, without this PR where HttpClient::get has the (non-default option) to retry 429 errors, do you really want a 429 error to be raised as an exception to the calling program which now has no handle on the internal progress from list_all (i.e. json_data['meta']['next']), nor the list of partial results it already collected? This seems wasteful as the calling program will have no priori knowledge of the size of the returning data set and will have to request the first parts of the data again from the server in its own implantation of list_all that tracks the 'next' url.

So... imo ledgerx-python should either not support the list_all methods at all or include the ability to actually return all the results of a long list successfully. There are decent argument for moving the 429 retry logic into the GenericResource::list_all function itself instead of the HttpClient::get method, but I think the non-default argument to retry is easier to maintain in HttpClient.

This error pattern was my primary motivation for implementing the retry because the list_all method was simply not working or reliable without it.

IMO, the calling program should be able to rely on the client library to handle recoverable errors, and if the calling program has its own and better mechanisms to handle the errors, it can choose not to use that functionality. And this instance, it is especially true in the case of these 429 errors from ledgerx, because the details on how to handle and recover are specific to ledgerx's json request/responses api.

-Rob

@westonplatter
Copy link
Owner

So... imo ledgerx-python should either not support the list_all methods at all or include the ability to actually return all the results of a long list successfully. There are decent argument for moving the 429 retry logic into the GenericResource::list_all function itself instead of the HttpClient::get method, but I think the non-default argument to retry is easier to maintain in HttpClient.

@robegan21 valid point. Thanks for highlighting this. My headspace is fully occupied with a time-sensitive project at work. I'm going to reopen this and then revisit when time allows.

@westonplatter westonplatter reopened this May 10, 2021
@a-re a-re deleted the AddDelaySupport branch January 1, 2022 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants