Upgrade localstack version and remove pinned dependencies #107

guille-sage · 2023-12-19T14:29:55Z

Upgrade localstack version and remove pinned dependencies

In commit bdfb0d6 we pinned AWS dependencies, since we weren't sure if eventq was ready for upgrading them. Also they dropped Ruby 2.5 support. Since we have moved to have no pinned dependencies if possible, each environment can choose the best and most up to date dependencies available, so we can keep the Ruby 2.5 support and also ensure using the latest AWS libraries in Ruby 3.

Also we removed the fixed localstack version, as in version 1.4.0 they addressed the multiregion support.

Refactor AWS integration specs

Upgrading our localstack version showed how brittle our integration tests are, because newer localstack versions react a bit slower (read: 100+ ms to pick up a message instead of 10+ ms). Since we used sleep statements a ton to wait "just the right amount of time" to exceed the TTL, specs became flaky or consistently failed because the worker was picking them up later than intended.

Looking deeper into it showed that our approach was flawed from the get go. While we name the variable in specs retry_attempt_count it is actually a call_count. Therefore all sleeps were off by one iteration. In the past this created the following patterns in specs:

0.0s Boot worker
0.1s sleep 1 to let worker pick up message
0.2s Worker processes message
1.1s Check message has been processed
1.1s sleep 1 to wait for retry
1.2s Worker processes retry

Sometimes there was just 10 ms between assertion and worker processing the message. Since the worker is on a thread, this could have failed with older localstack versions as well. With the newer and slower localstack version most specs looked like this:

0.0s Boot worker
0.1s sleep 1 to let worker pick up message
0.8s Worker processes message
1.1s Check message has been processed
1.1s sleep 1 to wait for retry
2.0s Worker processes retry
2.1s sleep 1 to wait for 2nd retry
3.1s Check for 2nd retry fails as worker has not picked up 2nd retry yet

The flaw here is that visibility_timeout does not tell when the message is picked up again, but for how long it is NOT PICKED UP again. So while we need to at least wait for the visibility_timeout in specs, we cannot know when the worker actually comes around to pick the message up again. If we were using bigger numbers, like minutes for the visibility_timeout rather than seconds, the 0.1 - 1.9 seconds sometimes needed to pick up a message would not matter, but also would slow down specs to a halt.

The second flaw is that we are basically testing libraries here. We should trust AWS to ensure messages are not picked up for the visibility_timeout we set. Instead we should only test if the right visibility_timeout numbers are calculated for the libraries to use.

As a result we have adjusted the specs to use a visibility_timeout of zero whenever possible to speed up specs. However, we note down the calculated number to ensure the right one is normally passed on to AWS. This allows us to use the exponential backoff strategy, without waiting expontentially long.

In order to remove the need for sleep in specs, as the operation to test is run asynchroneously, we introduced a thread based queue to wait for results on. This way, as soon as a result is pushed to the queue in another thread, the thread waiting for results immediately picks it up and continues running the spec and assertions. By adding a decorator around a method that is invoked after each time a message was processed, we can simply say wait_for_message_processed to ensure the spec wait just long enough to continue.

To test visibility_timeout we can now use aws_visibility_timeout_queue, which gets pushed the visibility_timeout calculated for each call, e.g.

expect(aws_visibility_timeout_queue.pop).to eq(call: 5, visibility_timeout: 5)

Using these techniques we sped up specs considerably, made them much more reliable and still kept the integration part as real AWS libraries are used. Also we can still integrate with a real AWS account instead of using localstack.

Set log level to FATAL during spec runs

Since the logger is printing messages to STDOUT, all DEBUG, INFO and ERROR logs produced during tests are included in the test output. Since we rely heavily on failing to process a message on purpose, we have littered the output with ERROR logs and traces as well. By setting the log level to FATAL during spec runs, we ensure that we only see the spec output during test runs.

* Extract check for max retry delay to its own method * Use milliseconds throughout the class and only convert to seconds when returning the final visibility timeout * Use guard clauses in the check_... methods * Ensure adhering to limits inside the timeout method

Jitter, or randomness in signals, is a common technique for avoiding calls that fail at the same time to retry at the same time. Therefore jitter can help avoid the thundering herd problem, where many instances of the same application all retry at the same time, causing a spike in load on the system. For example, if a service was down for some reason and comes back up, all queued events will hit at once and could cause the service to fail for most events. However, without jitter, the events will all retry at the same time, causing the same problem. Adding some jitter (or randomness) to the retry delays result in the events retrying at different times. In our implementation, the configuration says how much jitter is used based on the retry delay. For example, if `retry_jitter_ratio` is set to 0, no jitter will be applied and all events will retry after the usual retry delay. If it is set to 100, the retry delay will be a random number between 0 and the retry delay without jitter. If it is set to 20, the retry delay will be a random number between 80 and 100% of the retry delay without jitter.

spec/eventq_aws/integration/aws_queue_worker_v2_spec.rb

As we are now setting `src` as the default folder, we can just use the container loop from the mounted `src` volume. Also adjust paths in README and test script to work from root, rather inside the `script` folder.

In the upgraded environment with Ruby 3 booting up localstack takes too long, so some tests relying on it fail before it becomes available. Solution is just to add healthchecks and to wait for them to be healthy before starting tests. Also add this for Rspec test action on GitHub to avoid flaky test runs.

In commit bdfb0d6 we pinned AWS dependencies, since we weren't sure if `eventq` was ready for upgrading them. Also they dropped Ruby 2.5 support. Since we have moved to have no pinned dependencies if possible, each environment can choose the best and most up to date dependencies available, so we can keep the Ruby 2.5 support and also ensure using the latest AWS libraries in Ruby 3. Also we removed the fixed localstack version, as in version 1.4.0 they addressed the multiregion support.

Upgrading our localstack version showed how brittle our integration tests are, because newer localstack versions react a bit slower (read: 100+ ms to pick up a message instead of 10+ ms). Since we used `sleep` statements a ton to wait "just the right amount of time" to exceed the TTL, specs became flaky or consistently failed because the worker was picking them up later than intended. Looking deeper into it showed that our approach was flawed from the get go. While we name the variable in specs `retry_attempt_count` it is actually a `call_count`. Therefore all `sleeps` were off by one iteration. In the past this created the following patterns in specs: * 0.0s Boot worker * 0.1s `sleep 1` to let worker pick up message * 0.2s Worker processes message * 1.1s Check message has been processed * 1.1s `sleep 1` to wait for retry * 1.2s Worker processes retry Sometimes there was just 10 ms between assertion and worker processing the message. Since the worker is on a thread, this could have failed with older localstack versions as well. With the newer and slower localstack version most specs looked like this: * 0.0s Boot worker * 0.1s `sleep 1` to let worker pick up message * 0.8s Worker processes message * 1.1s Check message has been processed * 1.1s `sleep 1` to wait for retry * 2.0s Worker processes retry * 2.1s `sleep 1` to wait for 2nd retry * 3.1s Check for 2nd retry fails as worker has not picked up 2nd retry yet The flaw here is that `visibility_timeout` does not tell when the message is picked up again, but for how long it is NOT PICKED UP again. So while we need to _at least_ wait for the `visibility_timeout` in specs, we cannot know when the worker actually comes around to pick the message up again. If we were using bigger numbers, like minutes for the `visibility_timeout` rather than seconds, the 0.1 - 1.9 seconds sometimes needed to pick up a message would not matter, but also would slow down specs to a halt. The second flaw is that we are basically testing libraries here. We should trust AWS to ensure messages are not picked up for the `visibility_timeout` we set. Instead we should only test if the right `visibility_timeout` numbers are calculated for the libraries to use. As a result we have adjusted the specs to use a `visibility_timeout` of zero whenever possible to speed up specs. However, we note down the calculated number to ensure the right one is normally passed on to AWS. This allows us to use the exponential backoff strategy, without waiting expontentially long. In order to remove the need for `sleep` in specs, as the operation to test is run asynchroneously, we introduced a thread based queue to wait for results on. This way, as soon as a result is pushed to the queue in another thread, the thread waiting for results immediately picks it up and continues running the spec and assertions. By adding a decorator around a method that is invoked after each time a message was processed, we can simply say `wait_for_message_processed` to ensure the spec wait just long enough to continue. To test `visibility_timeout` we can now use `aws_visibility_timeout_queue`, which gets pushed the `visibility_timeout` calculated for each call, e.g. ``` expect(aws_visibility_timeout_queue.pop).to eq(call: 5, visibility_timeout: 5) ``` Using these techniques we sped up specs considerably, made them much more reliable and still kept the integration part as real AWS libraries are used. Also we can still integrate with a real AWS account instead of using `localstack`.

Since the logger is printing messages to STDOUT, all DEBUG, INFO and ERROR logs produced during tests are included in the test output. Since we rely heavily on failing to process a message on purpose, we have littered the output with ERROR logs and traces as well. By setting the log level to FATAL during spec runs, we ensure that we only see the spec output during test runs.

The base branch was changed.

florianpilz and others added 4 commits December 14, 2023 15:35

CI: Move to matrix testing 2.7, 3.0, 3.2

55a95a2

Bump codeclimate action

9e986fd

guille-sage force-pushed the localstack-upgrade branch 3 times, most recently from fb9bbc4 to 9f4de5a Compare December 19, 2023 15:11

guille-sage commented Dec 19, 2023

View reviewed changes

spec/eventq_aws/integration/aws_queue_worker_v2_spec.rb Outdated Show resolved Hide resolved

florianpilz added 4 commits December 21, 2023 12:26

Merge branch 'master' into ruby3

efbb1f6

Adjust script and source path

9f5c0c4

As we are now setting `src` as the default folder, we can just use the container loop from the mounted `src` volume. Also adjust paths in README and test script to work from root, rather inside the `script` folder.

Extend test matrix since we have support for Ruby 2.5+

d40ef02

florianpilz force-pushed the localstack-upgrade branch from 9f4de5a to 1446038 Compare January 2, 2024 15:11

florianpilz self-assigned this Jan 9, 2024

florianpilz force-pushed the localstack-upgrade branch 2 times, most recently from 25b54f8 to b052a31 Compare January 9, 2024 14:40

florianpilz marked this pull request as ready for review January 9, 2024 14:43

guille-sage and others added 3 commits January 10, 2024 09:07

florianpilz force-pushed the localstack-upgrade branch from b052a31 to ee4164a Compare January 10, 2024 08:35

florianpilz added enhancement awaiting review labels Jan 10, 2024

mob1970 previously approved these changes Jan 10, 2024

View reviewed changes

florianpilz added passed review and removed awaiting review labels Jan 10, 2024

Base automatically changed from ruby3 to master January 10, 2024 14:36

Merge branch 'master' into localstack-upgrade

7d89c5d

mob1970 approved these changes Jan 10, 2024

View reviewed changes

florianpilz merged commit 37bfbe3 into master Jan 10, 2024
6 checks passed

florianpilz deleted the localstack-upgrade branch January 10, 2024 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade localstack version and remove pinned dependencies #107

Upgrade localstack version and remove pinned dependencies #107

guille-sage commented Dec 19, 2023 •

edited by florianpilz

Loading

Upgrade localstack version and remove pinned dependencies #107

Upgrade localstack version and remove pinned dependencies #107

Conversation

guille-sage commented Dec 19, 2023 • edited by florianpilz Loading

Upgrade localstack version and remove pinned dependencies

Refactor AWS integration specs

Set log level to FATAL during spec runs

guille-sage commented Dec 19, 2023 •

edited by florianpilz

Loading