Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry token bucket not set #1234

Closed
1 task done
aajtodd opened this issue Jan 7, 2025 · 2 comments
Closed
1 task done

retry token bucket not set #1234

aajtodd opened this issue Jan 7, 2025 · 2 comments
Assignees
Labels
bug This issue is a bug. p1 This is a high priority issue potential-regression Marking this issue as a potential regression to be checked by team member

Comments

@aajtodd
Copy link
Contributor

aajtodd commented Jan 7, 2025

Describe the bug

While investigating a cross SDK issue I discovered we are not setting a token bucket in the config bag. This causes retries to only fail when max attempts are exceeded not when the token bucket has exhausted it's retry quota.

This appears to have been unintentionally removed in smithy-lang/smithy-rs#3071.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

We should fail long before a large number of max attempts are hit if every request is failing as the retry quota should be exhausted

Current Behavior

Failure only happens when max attempts are exceeded.

Without a token bucket configured the standard retry strategy never acquires permits.

Reproduction Steps

Reproduction, logs show we only ever fail on max retries.

use std::{error::Error, sync::{atomic::{AtomicBool, Ordering}, Arc}, time::Duration};
use aws_config::retry::RetryConfig;
use aws_smithy_runtime::client::http::test_util::infallible_client_fn;
use aws_sdk_sts::error::DisplayErrorContext;

type BoxError = Box<dyn Error + Send + Sync>;

#[tokio::main]
async fn main() -> Result<(), BoxError> {
    tracing_subscriber::fmt::init();

    let error_xml = r#"<ErrorResponse>
    <Error>
        <Type>Server</Type>
        <Code>Unknown Error</Code>
        <Message>Test bucket behavior</Message>
    </Error>
    <RequestId>foo-id</RequestId>
</ErrorResponse>"#;

    let success_xml = r#"
<GetCallerIdentityResponse xmlns="https://sts.amazonaws.com/doc/2011-06-15/">
  <GetCallerIdentityResult>
  <Arn>arn:aws:iam::123456789012:user/Alice</Arn>
  <UserId>AIDACKCEVSQ6C2EXAMPLE</UserId>
  <Account>123456789012</Account>
  </GetCallerIdentityResult>
  <ResponseMetadata>
    <RequestId>01234567-89ab-cdef-0123-456789abcdef</RequestId>
  </ResponseMetadata>
</GetCallerIdentityResponse>"#;

    let allow_success = Arc::new(AtomicBool::new(false));

    let allowed = allow_success.clone();
    let http_client = infallible_client_fn(move |_req| {
        let allow_success = allowed.load(Ordering::SeqCst);
        if allow_success {
            http_02x::Response::builder().status(200).body(success_xml).unwrap()
        }else {
            http_02x::Response::builder().status(500).body(error_xml).unwrap()
        }
    });
    
    let aws_config = aws_config::defaults(aws_config::BehaviorVersion::latest())
        .retry_config(
            RetryConfig::standard().with_max_attempts(100).with_max_backoff(Duration::from_millis(10))
        )
        .http_client(http_client)
        .load().await;

    let sts = aws_sdk_sts::Client::new(&aws_config);
    let result = sts.get_caller_identity().send().await;
    let err = result.expect_err("error expected");
    println!("error: {:#?}", DisplayErrorContext(&err));

    println!("second attempt!!!");

    allow_success.store(true, Ordering::SeqCst);
    let result2 = sts.get_caller_identity().send().await;
    let _ = result2.expect("expect success");

    println!("third attempt!!!");

    allow_success.store(false, Ordering::SeqCst);
    let result3 = sts.get_caller_identity().send().await;
    let err3 = result3.expect_err("error expected");
    println!("error 3: {:#?}", DisplayErrorContext(&err3));


    println!("next client, first attempt!");
    let sts_2 = aws_sdk_sts::Client::new(&aws_config);
    let result4 = sts_2.get_caller_identity().send().await;
    let err4 = result4.expect_err("error expected");
    println!("error: {:#?}", DisplayErrorContext(&err4));
    
    Ok(())
}

Possible Solution

Token bucket should always be present. We probably want it tied to the retry partition at the very least although our default retry partitioning seems to only use the service name and doesn't account for e.g. region

Additional Information/Context

No response

Version

Tested with `aws-sdk-sts@1.54.0`, `aws-smithy-runtime@1.76.0`

Environment details (OS name and version, etc.)

macos

Logs

No response

@aajtodd aajtodd added bug This issue is a bug. p1 This is a high priority issue labels Jan 7, 2025
@github-actions github-actions bot added the potential-regression Marking this issue as a potential regression to be checked by team member label Jan 7, 2025
@aajtodd aajtodd self-assigned this Jan 7, 2025
github-merge-queue bot pushed a commit to smithy-lang/smithy-rs that referenced this issue Jan 14, 2025
…odes (#3964)

## Motivation and Context
<!--- Why is this change required? What problem does it solve? -->
<!--- If it fixes an open issue, please link to the issue here -->
awslabs/aws-sdk-rust#1234

## Description
<!--- Describe your changes in detail -->

PR adds a new interceptor registered as part of the default retry plugin
components that ensures a token bucket is _always_ present and available
to the retry strategy. The buckets are partitioned off the retry
partition (which defaults to the service name and is already set by the
default plugin). We use a `static` variable in the runtime for this
which means that token buckets can and will apply to every single client
that uses the same retry partition. The implementation tries to avoid
contention on this new global lock by only consulting it if the retry
partition is overridden after client creation.

For AWS SDK clients I've updated the default retry partition clients are
created with to include the region when set. Now the default partition
for a client will be `{service}-{region}` (e.g. `sts-us-west-2`) rather
than just the service name (e.g. `sts`). This partitioning is a little
more granular and closer to what we want/expect as failures in one
region should not cause throttling to another (and vice versa for
success in one should not increase available quota in another).

I also updated the implementation to follow the SEP a little more
literally/closely as far as structure which fixes some subtle bugs.
State is updated in one place and we ensure that the token bucket is
always consulted (before the token bucket could be skipped in the case
of adaptive retries returning a delay and the adaptive rate limit was
updated in multiple branches).


## Testing
<!--- Please describe in detail how you tested your changes -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

## Checklist
<!--- If a checkbox below is not applicable, then please DELETE it
rather than leaving it unchecked -->
- [x ] For changes to the smithy-rs codegen or runtime crates, I have
created a changelog entry Markdown file in the `.changelog` directory,
specifying "client," "server," or both in the `applies_to` key.
- [ x] For changes to the AWS SDK, generated SDK code, or SDK runtime
crates, I have created a changelog entry Markdown file in the
`.changelog` directory, specifying "aws-sdk-rust" in the `applies_to`
key.

----

_By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice._
@aajtodd
Copy link
Contributor Author

aajtodd commented Jan 14, 2025

Fix will go out in next smithy-rs release.

@aajtodd aajtodd closed this as completed Jan 14, 2025
Copy link

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

landonxjames pushed a commit to smithy-lang/smithy-rs that referenced this issue Jan 14, 2025
…odes (#3964)

<!--- Why is this change required? What problem does it solve? -->
<!--- If it fixes an open issue, please link to the issue here -->
awslabs/aws-sdk-rust#1234

<!--- Describe your changes in detail -->

PR adds a new interceptor registered as part of the default retry plugin
components that ensures a token bucket is _always_ present and available
to the retry strategy. The buckets are partitioned off the retry
partition (which defaults to the service name and is already set by the
default plugin). We use a `static` variable in the runtime for this
which means that token buckets can and will apply to every single client
that uses the same retry partition. The implementation tries to avoid
contention on this new global lock by only consulting it if the retry
partition is overridden after client creation.

For AWS SDK clients I've updated the default retry partition clients are
created with to include the region when set. Now the default partition
for a client will be `{service}-{region}` (e.g. `sts-us-west-2`) rather
than just the service name (e.g. `sts`). This partitioning is a little
more granular and closer to what we want/expect as failures in one
region should not cause throttling to another (and vice versa for
success in one should not increase available quota in another).

I also updated the implementation to follow the SEP a little more
literally/closely as far as structure which fixes some subtle bugs.
State is updated in one place and we ensure that the token bucket is
always consulted (before the token bucket could be skipped in the case
of adaptive retries returning a delay and the adaptive rate limit was
updated in multiple branches).

<!--- Please describe in detail how you tested your changes -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

<!--- If a checkbox below is not applicable, then please DELETE it
rather than leaving it unchecked -->
- [x ] For changes to the smithy-rs codegen or runtime crates, I have
created a changelog entry Markdown file in the `.changelog` directory,
specifying "client," "server," or both in the `applies_to` key.
- [ x] For changes to the AWS SDK, generated SDK code, or SDK runtime
crates, I have created a changelog entry Markdown file in the
`.changelog` directory, specifying "aws-sdk-rust" in the `applies_to`
key.

----

_By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice._
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p1 This is a high priority issue potential-regression Marking this issue as a potential regression to be checked by team member
Projects
None yet
Development

No branches or pull requests

1 participant