RFC: Implementing Adaptive Retry Behavior #2171

Velfi · 2023-01-05T19:28:26Z

Motivation and Context

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

drganjoo · 2023-05-23T18:17:18Z

design/src/rfcs/rfc0031_implementing_adaptive_retry_behavior.md

+## Terminology
+
+- **Retry (Behavior / Strategy)**: How the SDK reacts to retryable errors.
+- **Standard Retry Behavior (SRB)**: The default behavior. When a retryable error is received, this behavior calculates an exponential backoff time based on how many retry attemps have already been made and then sleeps for that long before making another attempt.


The standard retry policy that we have implemented has a token bucket in it. So I am unclear on how the Adaptive retry policy differs from the standard one then?

Also, there is a TODO in the code for RetryPartition. Do we need to add something on what that is in the RFC?

drganjoo · 2023-05-23T18:26:30Z

design/src/rfcs/rfc0031_implementing_adaptive_retry_behavior.md

+    - `503` errors with the `RequestLimitExceeded` or `SlowDown` code.
+    - `509` errors with the `BandwidthLimitExceeded` code.
+    - Errors modeled with the `@retryable(throttling: true)` trait.
+- **Transient Error**: An error that should be retried as soon as is convenient. [Currently][transient-error-classifier], the Rust SDK treats the following as transient errors:


The RFC is clear on how a throttling error differs from a transient error, but it is not clear on how the retry behavior differs between the two?

When we say "as soon as is convenient" do we mean that in case of a Transient Error we do not look into the token bucket for an available token?

Currently, the Transient Error may have a different retry cost. Do we need to update the RFC to include the retry cost?

Can a user define their own RetryClassifier and define their own set of throttling and transient errors?

drganjoo · 2023-05-24T12:48:45Z

design/src/rfcs/rfc0031_implementing_adaptive_retry_behavior.md

+let retry_config = RetryConfigBuilder::standard()
+    .max_attempts(3)
+    .initial_backoff(Duration::from_secs(1))
+    .build();


RetryConfigBuilder::standard() also has a method called with_retry_mode. To choose adaptive retries, does one have to do the following:

let retry_config = RetryConfig::standard() .with_retry_mode(aws_smithy_types::retry::RetryMode::Adaptive);

drganjoo · 2023-05-24T12:50:06Z

design/src/rfcs/rfc0031_implementing_adaptive_retry_behavior.md

+    .load()
+    .await;
+```
+


The current implementation also has ReconnectMode. Can we please have a description on what that is.

drganjoo · 2023-05-24T14:04:44Z

design/src/rfcs/rfc0031_implementing_adaptive_retry_behavior.md

+- Is opt-in. The existing retry behavior will remain the default.
+- Will perform similarly to the existing retry behavior when comparing the rate of successful requests.
+- Will improve the experience of customers that use the SDK to send very large numbers of concurrent requests.
+


I can see that the RFC is a draft but I had a few comments so I thought I might as well add them now.

I couldn't find anything on how does one turn off retries. Is it going to be:

let retry_config = RetryConfig::disabled();

Which I am assuming is the same as setting:

let retry_config = RetryConfig::standard().with_max_attempts(1);

Are we going to allow retry policy to be configured separately for specific operations? If yes, what does the API look for that?

The RFC doesn't mention the RetryClassifier that each operation can have. Can we please add a section on how does that play a role in the policy behavior? and, how does one specify a RetryClassifier on a particular operation or the overall client.

What metrics will the client emit by default? and:
a) Can a user disable metrics?
b) Can a custom metric be emitted on a particular lifecycle event?

drganjoo · 2023-05-24T14:09:01Z

design/src/rfcs/rfc0031_implementing_adaptive_retry_behavior.md

+        - `LimitExceededException`
+        - `PriorRequestNotComplete`
+    - `403` errors with the `RequestThrottled` code.
+    - `502` errors with the `EC2ThrottledException` code.


Just to clarify, a 502 that does not have EC2ThrottledException is still going to be a transient error?

Also, are the codes case sensitive?

drganjoo · 2023-05-24T14:16:04Z

design/src/rfcs/rfc0031_implementing_adaptive_retry_behavior.md

+- [**Retryable Trait**](https://awslabs.github.io/smithy/spec/core.html#retryable-trait): a smithy trait indicating that an error may be retried by the client.
+- **(HTTP / Retry) Request**: An HTTP request. One or more of these will be made during an **Operation Request**.
+- **Operation Request**: A request to a smithy service. Each operation request may contain several **HTTP requests**.
+- **Throttling Error**: An error that should trigger the client's rate-limiting implementation. Currently, the Rust SDK has no special behavior for these errors. Once this RFC is implemented, the Rust SDK will treat the following as throttling errors:


I thought the current standard policy retries on Throttling / Transient errors?

drganjoo · 2023-05-24T14:23:35Z

design/src/rfcs/rfc0031_implementing_adaptive_retry_behavior.md

+    .await;
+```
+
+Because **ARB** works alongside **SRB** and isn't configurable, no new config fields will be added.


Can the size of token buckets be configured?

drganjoo · 2023-05-24T14:30:16Z

design/src/rfcs/rfc0031_implementing_adaptive_retry_behavior.md

+
+### The Service
+
+Because `tower::Service`s already define an async `ready` method, we'll use that to block sending requests unless they can acquire a token from the token bucket.


I am assuming that in the orchestrator world, we are going to replace this section in the RFC?

drganjoo · 2023-05-24T14:37:27Z

design/src/rfcs/rfc0031_implementing_adaptive_retry_behavior.md

+    RetryMode::Adaptive,
+];
+
+impl FromStr for RetryMode {


Am I correct in assuming that this is required so that the users can specify the retry mode in the environment and AWS Profiles?

adaptive retry RFC initial commit

def7bd6

Velfi changed the title ~~adaptive retry RFC initial commit~~ RFC: Implementing Adaptive Retry Behavior Jan 5, 2023

drganjoo reviewed May 24, 2023

View reviewed changes

Velfi closed this Apr 1, 2024

landonxjames deleted the rfc/implementing-adaptive-retry-behavior branch January 13, 2025 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Implementing Adaptive Retry Behavior #2171

RFC: Implementing Adaptive Retry Behavior #2171

Velfi commented Jan 5, 2023

drganjoo May 23, 2023

drganjoo May 23, 2023

drganjoo May 24, 2023

drganjoo May 24, 2023

drganjoo May 24, 2023

drganjoo May 24, 2023

drganjoo May 24, 2023

drganjoo May 24, 2023

drganjoo May 24, 2023

drganjoo May 24, 2023


		### The Service

		Because `tower::Service`s already define an async `ready` method, we'll use that to block sending requests unless they can acquire a token from the token bucket.

RFC: Implementing Adaptive Retry Behavior #2171

RFC: Implementing Adaptive Retry Behavior #2171

Conversation

Velfi commented Jan 5, 2023

Motivation and Context

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment