-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade localstack version and remove pinned dependencies #107
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Extract check for max retry delay to its own method * Use milliseconds throughout the class and only convert to seconds when returning the final visibility timeout * Use guard clauses in the check_... methods * Ensure adhering to limits inside the timeout method
Jitter, or randomness in signals, is a common technique for avoiding calls that fail at the same time to retry at the same time. Therefore jitter can help avoid the thundering herd problem, where many instances of the same application all retry at the same time, causing a spike in load on the system. For example, if a service was down for some reason and comes back up, all queued events will hit at once and could cause the service to fail for most events. However, without jitter, the events will all retry at the same time, causing the same problem. Adding some jitter (or randomness) to the retry delays result in the events retrying at different times. In our implementation, the configuration says how much jitter is used based on the retry delay. For example, if `retry_jitter_ratio` is set to 0, no jitter will be applied and all events will retry after the usual retry delay. If it is set to 100, the retry delay will be a random number between 0 and the retry delay without jitter. If it is set to 20, the retry delay will be a random number between 80 and 100% of the retry delay without jitter.
fb9bbc4
to
9f4de5a
Compare
guille-sage
commented
Dec 19, 2023
As we are now setting `src` as the default folder, we can just use the container loop from the mounted `src` volume. Also adjust paths in README and test script to work from root, rather inside the `script` folder.
In the upgraded environment with Ruby 3 booting up localstack takes too long, so some tests relying on it fail before it becomes available. Solution is just to add healthchecks and to wait for them to be healthy before starting tests. Also add this for Rspec test action on GitHub to avoid flaky test runs.
9f4de5a
to
1446038
Compare
25b54f8
to
b052a31
Compare
In commit bdfb0d6 we pinned AWS dependencies, since we weren't sure if `eventq` was ready for upgrading them. Also they dropped Ruby 2.5 support. Since we have moved to have no pinned dependencies if possible, each environment can choose the best and most up to date dependencies available, so we can keep the Ruby 2.5 support and also ensure using the latest AWS libraries in Ruby 3. Also we removed the fixed localstack version, as in version 1.4.0 they addressed the multiregion support.
Upgrading our localstack version showed how brittle our integration tests are, because newer localstack versions react a bit slower (read: 100+ ms to pick up a message instead of 10+ ms). Since we used `sleep` statements a ton to wait "just the right amount of time" to exceed the TTL, specs became flaky or consistently failed because the worker was picking them up later than intended. Looking deeper into it showed that our approach was flawed from the get go. While we name the variable in specs `retry_attempt_count` it is actually a `call_count`. Therefore all `sleeps` were off by one iteration. In the past this created the following patterns in specs: * 0.0s Boot worker * 0.1s `sleep 1` to let worker pick up message * 0.2s Worker processes message * 1.1s Check message has been processed * 1.1s `sleep 1` to wait for retry * 1.2s Worker processes retry Sometimes there was just 10 ms between assertion and worker processing the message. Since the worker is on a thread, this could have failed with older localstack versions as well. With the newer and slower localstack version most specs looked like this: * 0.0s Boot worker * 0.1s `sleep 1` to let worker pick up message * 0.8s Worker processes message * 1.1s Check message has been processed * 1.1s `sleep 1` to wait for retry * 2.0s Worker processes retry * 2.1s `sleep 1` to wait for 2nd retry * 3.1s Check for 2nd retry fails as worker has not picked up 2nd retry yet The flaw here is that `visibility_timeout` does not tell when the message is picked up again, but for how long it is NOT PICKED UP again. So while we need to _at least_ wait for the `visibility_timeout` in specs, we cannot know when the worker actually comes around to pick the message up again. If we were using bigger numbers, like minutes for the `visibility_timeout` rather than seconds, the 0.1 - 1.9 seconds sometimes needed to pick up a message would not matter, but also would slow down specs to a halt. The second flaw is that we are basically testing libraries here. We should trust AWS to ensure messages are not picked up for the `visibility_timeout` we set. Instead we should only test if the right `visibility_timeout` numbers are calculated for the libraries to use. As a result we have adjusted the specs to use a `visibility_timeout` of zero whenever possible to speed up specs. However, we note down the calculated number to ensure the right one is normally passed on to AWS. This allows us to use the exponential backoff strategy, without waiting expontentially long. In order to remove the need for `sleep` in specs, as the operation to test is run asynchroneously, we introduced a thread based queue to wait for results on. This way, as soon as a result is pushed to the queue in another thread, the thread waiting for results immediately picks it up and continues running the spec and assertions. By adding a decorator around a method that is invoked after each time a message was processed, we can simply say `wait_for_message_processed` to ensure the spec wait just long enough to continue. To test `visibility_timeout` we can now use `aws_visibility_timeout_queue`, which gets pushed the `visibility_timeout` calculated for each call, e.g. ``` expect(aws_visibility_timeout_queue.pop).to eq(call: 5, visibility_timeout: 5) ``` Using these techniques we sped up specs considerably, made them much more reliable and still kept the integration part as real AWS libraries are used. Also we can still integrate with a real AWS account instead of using `localstack`.
Since the logger is printing messages to STDOUT, all DEBUG, INFO and ERROR logs produced during tests are included in the test output. Since we rely heavily on failing to process a message on purpose, we have littered the output with ERROR logs and traces as well. By setting the log level to FATAL during spec runs, we ensure that we only see the spec output during test runs.
b052a31
to
ee4164a
Compare
mob1970
previously approved these changes
Jan 10, 2024
mob1970
approved these changes
Jan 10, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Upgrade localstack version and remove pinned dependencies
In commit bdfb0d6 we pinned AWS dependencies, since we weren't sure if
eventq
was ready for upgrading them. Also they dropped Ruby 2.5 support. Since we have moved to have no pinned dependencies if possible, each environment can choose the best and most up to date dependencies available, so we can keep the Ruby 2.5 support and also ensure using the latest AWS libraries in Ruby 3.Also we removed the fixed localstack version, as in version 1.4.0 they addressed the multiregion support.
Refactor AWS integration specs
Upgrading our localstack version showed how brittle our integration tests are, because newer localstack versions react a bit slower (read: 100+ ms to pick up a message instead of 10+ ms). Since we used
sleep
statements a ton to wait "just the right amount of time" to exceed the TTL, specs became flaky or consistently failed because the worker was picking them up later than intended.Looking deeper into it showed that our approach was flawed from the get go. While we name the variable in specs
retry_attempt_count
it is actually acall_count
. Therefore allsleeps
were off by one iteration. In the past this created the following patterns in specs:sleep 1
to let worker pick up messagesleep 1
to wait for retrySometimes there was just 10 ms between assertion and worker processing the message. Since the worker is on a thread, this could have failed with older localstack versions as well. With the newer and slower localstack version most specs looked like this:
sleep 1
to let worker pick up messagesleep 1
to wait for retrysleep 1
to wait for 2nd retryThe flaw here is that
visibility_timeout
does not tell when the message is picked up again, but for how long it is NOT PICKED UP again. So while we need to at least wait for thevisibility_timeout
in specs, we cannot know when the worker actually comes around to pick the message up again. If we were using bigger numbers, like minutes for thevisibility_timeout
rather than seconds, the 0.1 - 1.9 seconds sometimes needed to pick up a message would not matter, but also would slow down specs to a halt.The second flaw is that we are basically testing libraries here. We should trust AWS to ensure messages are not picked up for the
visibility_timeout
we set. Instead we should only test if the rightvisibility_timeout
numbers are calculated for the libraries to use.As a result we have adjusted the specs to use a
visibility_timeout
of zero whenever possible to speed up specs. However, we note down the calculated number to ensure the right one is normally passed on to AWS. This allows us to use the exponential backoff strategy, without waiting expontentially long.In order to remove the need for
sleep
in specs, as the operation to test is run asynchroneously, we introduced a thread based queue to wait for results on. This way, as soon as a result is pushed to the queue in another thread, the thread waiting for results immediately picks it up and continues running the spec and assertions. By adding a decorator around a method that is invoked after each time a message was processed, we can simply saywait_for_message_processed
to ensure the spec wait just long enough to continue.To test
visibility_timeout
we can now useaws_visibility_timeout_queue
, which gets pushed thevisibility_timeout
calculated for each call, e.g.Using these techniques we sped up specs considerably, made them much more reliable and still kept the integration part as real AWS libraries are used. Also we can still integrate with a real AWS account instead of using
localstack
.Set log level to FATAL during spec runs
Since the logger is printing messages to STDOUT, all DEBUG, INFO and ERROR logs produced during tests are included in the test output. Since we rely heavily on failing to process a message on purpose, we have littered the output with ERROR logs and traces as well. By setting the log level to FATAL during spec runs, we ensure that we only see the spec output during test runs.