-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Implementing Adaptive Retry Behavior #2171
Conversation
## Terminology | ||
|
||
- **Retry (Behavior / Strategy)**: How the SDK reacts to retryable errors. | ||
- **Standard Retry Behavior (SRB)**: The default behavior. When a retryable error is received, this behavior calculates an exponential backoff time based on how many retry attemps have already been made and then sleeps for that long before making another attempt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard retry policy that we have implemented has a token bucket in it. So I am unclear on how the Adaptive retry policy differs from the standard one then?
Also, there is a TODO in the code for RetryPartition. Do we need to add something on what that is in the RFC?
- `503` errors with the `RequestLimitExceeded` or `SlowDown` code. | ||
- `509` errors with the `BandwidthLimitExceeded` code. | ||
- Errors modeled with the `@retryable(throttling: true)` trait. | ||
- **Transient Error**: An error that should be retried as soon as is convenient. [Currently][transient-error-classifier], the Rust SDK treats the following as transient errors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RFC is clear on how a throttling error differs from a transient error, but it is not clear on how the retry behavior differs between the two?
When we say "as soon as is convenient" do we mean that in case of a Transient Error we do not look into the token bucket for an available token?
Currently, the Transient Error may have a different retry cost. Do we need to update the RFC to include the retry cost?
Can a user define their own RetryClassifier and define their own set of throttling and transient errors?
let retry_config = RetryConfigBuilder::standard() | ||
.max_attempts(3) | ||
.initial_backoff(Duration::from_secs(1)) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RetryConfigBuilder::standard()
also has a method called with_retry_mode
. To choose adaptive retries, does one have to do the following:
let retry_config = RetryConfig::standard()
.with_retry_mode(aws_smithy_types::retry::RetryMode::Adaptive);
.load() | ||
.await; | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation also has ReconnectMode
. Can we please have a description on what that is.
- Is opt-in. The existing retry behavior will remain the default. | ||
- Will perform similarly to the existing retry behavior when comparing the rate of successful requests. | ||
- Will improve the experience of customers that use the SDK to send very large numbers of concurrent requests. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that the RFC is a draft but I had a few comments so I thought I might as well add them now.
- I couldn't find anything on how does one turn off retries. Is it going to be:
let retry_config = RetryConfig::disabled();
Which I am assuming is the same as setting:
let retry_config = RetryConfig::standard().with_max_attempts(1);
-
Are we going to allow retry policy to be configured separately for specific operations? If yes, what does the API look for that?
-
The RFC doesn't mention the RetryClassifier that each operation can have. Can we please add a section on how does that play a role in the policy behavior? and, how does one specify a RetryClassifier on a particular operation or the overall client.
-
What metrics will the client emit by default? and:
a) Can a user disable metrics?
b) Can a custom metric be emitted on a particular lifecycle event?
- `LimitExceededException` | ||
- `PriorRequestNotComplete` | ||
- `403` errors with the `RequestThrottled` code. | ||
- `502` errors with the `EC2ThrottledException` code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, a 502 that does not have EC2ThrottledException
is still going to be a transient error?
Also, are the codes case sensitive?
- [**Retryable Trait**](https://awslabs.github.io/smithy/spec/core.html#retryable-trait): a smithy trait indicating that an error may be retried by the client. | ||
- **(HTTP / Retry) Request**: An HTTP request. One or more of these will be made during an **Operation Request**. | ||
- **Operation Request**: A request to a smithy service. Each operation request may contain several **HTTP requests**. | ||
- **Throttling Error**: An error that should trigger the client's rate-limiting implementation. Currently, the Rust SDK has no special behavior for these errors. Once this RFC is implemented, the Rust SDK will treat the following as throttling errors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the current standard policy retries on Throttling / Transient errors?
.await; | ||
``` | ||
|
||
Because **ARB** works alongside **SRB** and isn't configurable, no new config fields will be added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the size of token buckets be configured?
|
||
### The Service | ||
|
||
Because `tower::Service`s already define an async `ready` method, we'll use that to block sending requests unless they can acquire a token from the token bucket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming that in the orchestrator world, we are going to replace this section in the RFC?
RetryMode::Adaptive, | ||
]; | ||
|
||
impl FromStr for RetryMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct in assuming that this is required so that the users can specify the retry mode in the environment
and AWS Profile
s?
Motivation and Context
#1785
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.