Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade localstack version and remove pinned dependencies #107

Merged
merged 12 commits into from
Jan 10, 2024

Conversation

guille-sage
Copy link
Contributor

@guille-sage guille-sage commented Dec 19, 2023

Upgrade localstack version and remove pinned dependencies

In commit bdfb0d6 we pinned AWS dependencies, since we weren't sure if eventq was ready for upgrading them. Also they dropped Ruby 2.5 support. Since we have moved to have no pinned dependencies if possible, each environment can choose the best and most up to date dependencies available, so we can keep the Ruby 2.5 support and also ensure using the latest AWS libraries in Ruby 3.

Also we removed the fixed localstack version, as in version 1.4.0 they addressed the multiregion support.

Refactor AWS integration specs

Upgrading our localstack version showed how brittle our integration tests are, because newer localstack versions react a bit slower (read: 100+ ms to pick up a message instead of 10+ ms). Since we used sleep statements a ton to wait "just the right amount of time" to exceed the TTL, specs became flaky or consistently failed because the worker was picking them up later than intended.

Looking deeper into it showed that our approach was flawed from the get go. While we name the variable in specs retry_attempt_count it is actually a call_count. Therefore all sleeps were off by one iteration. In the past this created the following patterns in specs:

  • 0.0s Boot worker
  • 0.1s sleep 1 to let worker pick up message
  • 0.2s Worker processes message
  • 1.1s Check message has been processed
  • 1.1s sleep 1 to wait for retry
  • 1.2s Worker processes retry

Sometimes there was just 10 ms between assertion and worker processing the message. Since the worker is on a thread, this could have failed with older localstack versions as well. With the newer and slower localstack version most specs looked like this:

  • 0.0s Boot worker
  • 0.1s sleep 1 to let worker pick up message
  • 0.8s Worker processes message
  • 1.1s Check message has been processed
  • 1.1s sleep 1 to wait for retry
  • 2.0s Worker processes retry
  • 2.1s sleep 1 to wait for 2nd retry
  • 3.1s Check for 2nd retry fails as worker has not picked up 2nd retry yet

The flaw here is that visibility_timeout does not tell when the message is picked up again, but for how long it is NOT PICKED UP again. So while we need to at least wait for the visibility_timeout in specs, we cannot know when the worker actually comes around to pick the message up again. If we were using bigger numbers, like minutes for the visibility_timeout rather than seconds, the 0.1 - 1.9 seconds sometimes needed to pick up a message would not matter, but also would slow down specs to a halt.

The second flaw is that we are basically testing libraries here. We should trust AWS to ensure messages are not picked up for the visibility_timeout we set. Instead we should only test if the right visibility_timeout numbers are calculated for the libraries to use.

As a result we have adjusted the specs to use a visibility_timeout of zero whenever possible to speed up specs. However, we note down the calculated number to ensure the right one is normally passed on to AWS. This allows us to use the exponential backoff strategy, without waiting expontentially long.

In order to remove the need for sleep in specs, as the operation to test is run asynchroneously, we introduced a thread based queue to wait for results on. This way, as soon as a result is pushed to the queue in another thread, the thread waiting for results immediately picks it up and continues running the spec and assertions. By adding a decorator around a method that is invoked after each time a message was processed, we can simply say wait_for_message_processed to ensure the spec wait just long enough to continue.

To test visibility_timeout we can now use aws_visibility_timeout_queue, which gets pushed the visibility_timeout calculated for each call, e.g.

expect(aws_visibility_timeout_queue.pop).to eq(call: 5, visibility_timeout: 5)

Using these techniques we sped up specs considerably, made them much more reliable and still kept the integration part as real AWS libraries are used. Also we can still integrate with a real AWS account instead of using localstack.

Set log level to FATAL during spec runs

Since the logger is printing messages to STDOUT, all DEBUG, INFO and ERROR logs produced during tests are included in the test output. Since we rely heavily on failing to process a message on purpose, we have littered the output with ERROR logs and traces as well. By setting the log level to FATAL during spec runs, we ensure that we only see the spec output during test runs.

florianpilz and others added 4 commits December 14, 2023 15:35
* Extract check for max retry delay to its own method
* Use milliseconds throughout the class and only convert to seconds when
  returning the final visibility timeout
* Use guard clauses in the check_... methods
* Ensure adhering to limits inside the timeout method
Jitter, or randomness in signals, is a common technique for avoiding
calls that fail at the same time to retry at the same time. Therefore
jitter can help avoid the thundering herd problem, where many instances
of the same application all retry at the same time, causing a spike in
load on the system. For example, if a service was down for some reason
and comes back up, all queued events will hit at once and could cause
the service to fail for most events. However, without jitter, the events
will all retry at the same time, causing the same problem. Adding some
jitter (or randomness) to the retry delays result in the events retrying
at different times.

In our implementation, the configuration says how much jitter is used
based on the retry delay. For example, if `retry_jitter_ratio` is set to
0, no jitter will be applied and all events will retry after the usual
retry delay. If it is set to 100, the retry delay will be a random
number between 0 and the retry delay without jitter. If it is set to 20,
the retry delay will be a random number between 80 and 100% of the retry
delay without jitter.
@guille-sage guille-sage force-pushed the localstack-upgrade branch 3 times, most recently from fb9bbc4 to 9f4de5a Compare December 19, 2023 15:11
As we are now setting `src` as the default folder, we can just use the
container loop from the mounted `src` volume. Also adjust paths in
README and test script to work from root, rather inside the `script`
folder.
In the upgraded environment with Ruby 3 booting up localstack takes too
long, so some tests relying on it fail before it becomes available.
Solution is just to add healthchecks and to wait for them to be healthy
before starting tests. Also add this for Rspec test action on GitHub to
avoid flaky test runs.
@florianpilz florianpilz self-assigned this Jan 9, 2024
@florianpilz florianpilz force-pushed the localstack-upgrade branch 2 times, most recently from 25b54f8 to b052a31 Compare January 9, 2024 14:40
@florianpilz florianpilz marked this pull request as ready for review January 9, 2024 14:43
guille-sage and others added 3 commits January 10, 2024 09:07
In commit bdfb0d6 we pinned AWS
dependencies, since we weren't sure if `eventq` was ready for upgrading
them. Also they dropped Ruby 2.5 support. Since we have moved to have no
pinned dependencies if possible, each environment can choose the best
and most up to date dependencies available, so we can keep the Ruby 2.5
support and also ensure using the latest AWS libraries in Ruby 3.

Also we removed the fixed localstack version, as in version 1.4.0 they
addressed the multiregion support.
Upgrading our localstack version showed how brittle our integration
tests are, because newer localstack versions react a bit slower (read:
100+ ms to pick up a message instead of 10+ ms). Since we used `sleep`
statements a ton to wait "just the right amount of time" to exceed the
TTL, specs became flaky or consistently failed because the worker was
picking them up later than intended.

Looking deeper into it showed that our approach was flawed from the get
go. While we name the variable in specs `retry_attempt_count` it is
actually a `call_count`. Therefore all `sleeps` were off by one
iteration. In the past this created the following patterns in specs:
* 0.0s Boot worker
* 0.1s `sleep 1` to let worker pick up message
* 0.2s Worker processes message
* 1.1s Check message has been processed
* 1.1s `sleep 1` to wait for retry
* 1.2s Worker processes retry

Sometimes there was just 10 ms between assertion and worker processing
the message. Since the worker is on a thread, this could have failed
with older localstack versions as well. With the newer and slower
localstack version most specs looked like this:
* 0.0s Boot worker
* 0.1s `sleep 1` to let worker pick up message
* 0.8s Worker processes message
* 1.1s Check message has been processed
* 1.1s `sleep 1` to wait for retry
* 2.0s Worker processes retry
* 2.1s `sleep 1` to wait for 2nd retry
* 3.1s Check for 2nd retry fails as worker has not picked up 2nd retry yet

The flaw here is that `visibility_timeout` does not tell when the
message is picked up again, but for how long it is NOT PICKED UP again.
So while we need to _at least_ wait for the `visibility_timeout` in
specs, we cannot know when the worker actually comes around to pick the
message up again. If we were using bigger numbers, like minutes for the
`visibility_timeout` rather than seconds, the 0.1 - 1.9 seconds
sometimes needed to pick up a message would not matter, but also would
slow down specs to a halt.

The second flaw is that we are basically testing libraries here. We
should trust AWS to ensure messages are not picked up for the
`visibility_timeout` we set. Instead we should only test if the right
`visibility_timeout` numbers are calculated for the libraries to use.

As a result we have adjusted the specs to use a `visibility_timeout` of
zero whenever possible to speed up specs. However, we note down the
calculated number to ensure the right one is normally passed on to AWS.
This allows us to use the exponential backoff strategy, without waiting
expontentially long.

In order to remove the need for `sleep` in specs, as the operation to
test is run asynchroneously, we introduced a thread based queue to wait
for results on. This way, as soon as a result is pushed to the queue in
another thread, the thread waiting for results immediately picks it up
and continues running the spec and assertions. By adding a decorator
around a method that is invoked after each time a message was processed,
we can simply say `wait_for_message_processed` to ensure the spec wait
just long enough to continue.

To test `visibility_timeout` we can now use
`aws_visibility_timeout_queue`, which gets pushed the
`visibility_timeout` calculated for each call, e.g.
```
expect(aws_visibility_timeout_queue.pop).to eq(call: 5, visibility_timeout: 5)
```

Using these techniques we sped up specs considerably, made them much
more reliable and still kept the integration part as real AWS libraries
are used. Also we can still integrate with a real AWS account instead of
using `localstack`.
Since the logger is printing messages to STDOUT, all DEBUG, INFO and
ERROR logs produced during tests are included in the test output. Since
we rely heavily on failing to process a message on purpose, we have
littered the output with ERROR logs and traces as well. By setting the
log level to FATAL during spec runs, we ensure that we only see the spec
output during test runs.
mob1970
mob1970 previously approved these changes Jan 10, 2024
Base automatically changed from ruby3 to master January 10, 2024 14:36
@florianpilz florianpilz dismissed mob1970’s stale review January 10, 2024 14:36

The base branch was changed.

@florianpilz florianpilz merged commit 37bfbe3 into master Jan 10, 2024
6 checks passed
@florianpilz florianpilz deleted the localstack-upgrade branch January 10, 2024 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants