Skip to content

Commit

Permalink
Retry on HTTP 50x errors (#603)
Browse files Browse the repository at this point in the history
This updates our remote IO HTTP handler to check the status code of the response. If we get a 50x error, we'll retry up to some limit.

Closes #601

Authors:
  - Tom Augspurger (https://github.com/TomAugspurger)
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Bradley Dice (https://github.com/bdice)

URL: #603
  • Loading branch information
TomAugspurger authored Feb 24, 2025
1 parent 637cac5 commit 25051e6
Show file tree
Hide file tree
Showing 21 changed files with 617 additions and 34 deletions.
1 change: 1 addition & 0 deletions conda/environments/all_cuda-118_arch-aarch64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ dependencies:
- pre-commit
- pytest
- pytest-cov
- pytest-timeout
- python>=3.10,<3.13
- rangehttpserver
- rapids-build-backend>=0.3.0,<0.4.0.dev0
Expand Down
1 change: 1 addition & 0 deletions conda/environments/all_cuda-118_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ dependencies:
- pre-commit
- pytest
- pytest-cov
- pytest-timeout
- python>=3.10,<3.13
- rangehttpserver
- rapids-build-backend>=0.3.0,<0.4.0.dev0
Expand Down
1 change: 1 addition & 0 deletions conda/environments/all_cuda-128_arch-aarch64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ dependencies:
- pre-commit
- pytest
- pytest-cov
- pytest-timeout
- python>=3.10,<3.13
- rangehttpserver
- rapids-build-backend>=0.3.0,<0.4.0.dev0
Expand Down
1 change: 1 addition & 0 deletions conda/environments/all_cuda-128_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ dependencies:
- pre-commit
- pytest
- pytest-cov
- pytest-timeout
- python>=3.10,<3.13
- rangehttpserver
- rapids-build-backend>=0.3.0,<0.4.0.dev0
Expand Down
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ set(SOURCES
"src/bounce_buffer.cpp"
"src/buffer.cpp"
"src/compat_mode.cpp"
"src/http_status_codes.cpp"
"src/cufile/config.cpp"
"src/cufile/driver.cpp"
"src/defaults.cpp"
Expand Down
22 changes: 21 additions & 1 deletion cpp/doxygen/main_page.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,10 +107,30 @@ To improve performance of small IO requests, `.pread()` and `.pwrite()` implemen
This setting can also be controlled by `defaults::gds_threshold()` and `defaults::gds_threshold_reset()`.

#### Size of the Bounce Buffer (KVIKIO_GDS_THRESHOLD)
KvikIO might have to use intermediate host buffers (one per thread) when copying between files and device memory. Set the environment variable ``KVIKIO_BOUNCE_BUFFER_SIZE`` to the size (in bytes) of these "bounce" buffers. If not set, the default value is 16777216 (16 MiB).
KvikIO might have to use intermediate host buffers (one per thread) when copying between files and device memory. Set the environment variable `KVIKIO_BOUNCE_BUFFER_SIZE` to the size (in bytes) of these "bounce" buffers. If not set, the default value is 16777216 (16 MiB).

This setting can also be controlled by `defaults::bounce_buffer_size()` and `defaults::bounce_buffer_size_reset()`.

#### HTTP Retries

The behavior when a remote IO read returns a error can be controlled through the `KVIKIO_HTTP_STATUS_CODES` and `KVIKIO_HTTP_MAX_ATTEMPTS` environment variables.
`KVIKIO_HTTP_STATUS_CODES` controls the status codes to retry, and `KVIKIO_HTTP_MAX_ATTEMPTS` controls the maximum number of attempts to make before throwing an exception.

When a response with a status code in the list of retryable codes is received, KvikIO will wait for some period of time before retrying the request.
It will keep retrying until reaching the maximum number of attempts.

By default, KvikIO will retry responses with the following status codes:

- 429
- 500
- 502
- 503
- 504

KvikIO will, by default, make three attempts per read.
Note that if you're reading a large file that has been split into multiple reads through the KvikIO's task size setting, then *each* task will be retried up to the maximum number of attempts.

These settings can also be controlled by `defaults::http_max_attempts()`, `defaults::http_max_attempts_reset()`, `defaults::http_status_codes()`, and `defaults::http_status_codes_reset()`.

## Example

Expand Down
49 changes: 48 additions & 1 deletion cpp/include/kvikio/defaults.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
#include <string>

#include <kvikio/compat_mode.hpp>
#include <kvikio/http_status_codes.hpp>
#include <kvikio/shim/cufile.hpp>
#include <kvikio/threadpool_wrapper.hpp>

Expand Down Expand Up @@ -53,6 +54,9 @@ bool getenv_or(std::string_view env_var_name, bool default_val);
template <>
CompatMode getenv_or(std::string_view env_var_name, CompatMode default_val);

template <>
std::vector<int> getenv_or(std::string_view env_var_name, std::vector<int> default_val);

/**
* @brief Singleton class of default values used throughout KvikIO.
*
Expand All @@ -64,6 +68,8 @@ class defaults {
std::size_t _task_size;
std::size_t _gds_threshold;
std::size_t _bounce_buffer_size;
std::size_t _http_max_attempts;
std::vector<int> _http_status_codes;

static unsigned int get_num_threads_from_env();

Expand Down Expand Up @@ -153,7 +159,7 @@ class defaults {
* always use the same thread pool however it is possible to change number of
* threads in the pool (see `kvikio::default::thread_pool_nthreads_reset()`).
*
* @return The the default thread pool instance.
* @return The default thread pool instance.
*/
[[nodiscard]] static BS_thread_pool& thread_pool();

Expand Down Expand Up @@ -230,6 +236,47 @@ class defaults {
* @param nbytes The bounce buffer size in bytes.
*/
static void bounce_buffer_size_reset(std::size_t nbytes);

/**
* @brief Get the maximum number of attempts per remote IO read.
*
* Set the value using `kvikio::default::http_max_attempts_reset()` or by setting
* the `KVIKIO_HTTP_MAX_ATTEMPTS` environment variable. If not set, the value is 3.
*
* @return The maximum number of remote IO reads to attempt before raising an
* error.
*/
[[nodiscard]] static std::size_t http_max_attempts();

/**
* @brief Reset the maximum number of attempts per remote IO read.
*
* @param attempts The maximum number of attempts to try before raising an error.
*/
static void http_max_attempts_reset(std::size_t attempts);

/**
* @brief The list of HTTP status codes to retry.
*
* Set the value using `kvikio::default::http_status_codes()` or by setting the
* `KVIKIO_HTTP_STATUS_CODES` environment variable. If not set, the default value is
*
* - 429
* - 500
* - 502
* - 503
* - 504
*
* @return The list of HTTP status codes to retry.
*/
[[nodiscard]] static std::vector<int> const& http_status_codes();

/**
* @brief Reset the list of HTTP status codes to retry.
*
* @param status_codes The HTTP status codes to retry.
*/
static void http_status_codes_reset(std::vector<int> status_codes);
};

} // namespace kvikio
39 changes: 39 additions & 0 deletions cpp/include/kvikio/http_status_codes.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
/*
* Copyright (c) 2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cstdint>
#include <string>
#include <vector>

namespace kvikio {
namespace detail {
/**
* @brief Parse a string of comma-separated string of HTTP status codes.
*
* @param env_var_name The environment variable holding the string.
* Used to report errors.
* @param status_codes The comma-separated string of HTTP status
* codes. Each code should be a 3-digit integer.
*
* @return The vector with the parsed, integer HTTP status codes.
*/
std::vector<int> parse_http_status_codes(std::string_view env_var_name,
std::string const& status_codes);
} // namespace detail

} // namespace kvikio
42 changes: 41 additions & 1 deletion cpp/src/defaults.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

#include <cstddef>
#include <cstdlib>
#include <regex>
#include <sstream>
#include <stdexcept>
#include <string>
Expand All @@ -24,10 +25,10 @@

#include <kvikio/compat_mode.hpp>
#include <kvikio/defaults.hpp>
#include <kvikio/http_status_codes.hpp>
#include <kvikio/shim/cufile.hpp>

namespace kvikio {

template <>
bool getenv_or(std::string_view env_var_name, bool default_val)
{
Expand Down Expand Up @@ -68,6 +69,17 @@ CompatMode getenv_or(std::string_view env_var_name, CompatMode default_val)
return detail::parse_compat_mode_str(env_val);
}

template <>
std::vector<int> getenv_or(std::string_view env_var_name, std::vector<int> default_val)
{
auto* const env_val = std::getenv(env_var_name.data());
if (env_val == nullptr) { return std::move(default_val); }
std::string const int_str(env_val);
if (int_str.empty()) { return std::move(default_val); }

return detail::parse_http_status_codes(env_var_name, int_str);
}

unsigned int defaults::get_num_threads_from_env()
{
int const ret = getenv_or("KVIKIO_NTHREADS", 1);
Expand Down Expand Up @@ -109,6 +121,19 @@ defaults::defaults()
}
_bounce_buffer_size = env;
}
// Determine the default value of `http_max_attempts`
{
ssize_t const env = getenv_or("KVIKIO_HTTP_MAX_ATTEMPTS", 3);
if (env <= 0) {
throw std::invalid_argument("KVIKIO_HTTP_MAX_ATTEMPTS has to be a positive integer");
}
_http_max_attempts = env;
}
// Determine the default value of `http_status_codes`
{
_http_status_codes =
getenv_or("KVIKIO_HTTP_STATUS_CODES", std::vector<int>{429, 500, 502, 503, 504});
}
}

defaults* defaults::instance()
Expand Down Expand Up @@ -177,4 +202,19 @@ void defaults::bounce_buffer_size_reset(std::size_t nbytes)
instance()->_bounce_buffer_size = nbytes;
}

std::size_t defaults::http_max_attempts() { return instance()->_http_max_attempts; }

void defaults::http_max_attempts_reset(std::size_t attempts)
{
if (attempts == 0) { throw std::invalid_argument("attempts must be a positive integer"); }
instance()->_http_max_attempts = attempts;
}

std::vector<int> const& defaults::http_status_codes() { return instance()->_http_status_codes; }

void defaults::http_status_codes_reset(std::vector<int> status_codes)
{
instance()->_http_status_codes = std::move(status_codes);
}

} // namespace kvikio
52 changes: 52 additions & 0 deletions cpp/src/http_status_codes.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
/*
* Copyright (c) 2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <cstddef>
#include <cstdlib>
#include <regex>
#include <sstream>
#include <stdexcept>
#include <string>
#include <vector>

namespace kvikio {

namespace detail {
std::vector<int> parse_http_status_codes(std::string_view env_var_name,
std::string const& status_codes)
{
// Ensure `status_codes` consists only of 3-digit integers separated by commas, allowing spaces.
std::regex const check_pattern(R"(^\s*\d{3}\s*(\s*,\s*\d{3}\s*)*$)");
if (!std::regex_match(status_codes, check_pattern)) {
throw std::invalid_argument(std::string{env_var_name} +
": invalid format, expected comma-separated integers.");
}

// Match every integer in `status_codes`.
std::regex const number_pattern(R"(\d+)");

// For each match, we push_back `std::stoi(match.str())` into `ret`.
std::vector<int> ret;
std::transform(std::sregex_iterator(status_codes.begin(), status_codes.end(), number_pattern),
std::sregex_iterator(),
std::back_inserter(ret),
[](std::smatch const& match) -> int { return std::stoi(match.str()); });
return ret;
}

} // namespace detail

} // namespace kvikio
Loading

0 comments on commit 25051e6

Please sign in to comment.