Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update and stabilize defrag tests #1762

Open
wants to merge 23 commits into
base: unstable
Choose a base branch
from

Conversation

JimB123
Copy link
Contributor

@JimB123 JimB123 commented Feb 21, 2025

Addressing: #1746
A number of tests related to defrag have had stability problems.

One reason for stability issues is that the tests are run sequentially on the same server, and it takes jemalloc some time to release freed pages. This is especially noticed immediately after a flushall. Over a period of 10 seconds, it is observable that the "fragmentation bytes" can decrease by several MB. Another reason is that there's no standardization between tests. For each test, people have been independently hacking/tweaking the success criteria, without addressing underlying issues.

This update revamps all of the defrag tests:

  • A fresh server is started for each test. Running each test in isolation improves stability.
  • A uniform function log_frag is now used for debug logging
  • A uniform function perform_defrag_test ensures that each test is written and executed in a uniform fashion. Limits are imposed to ensure that the defrag results are consistent/reproducible. The intent is to eliminate failures do to various tweaks to values in individual tests.
  • Latency is tested much more strictly for most tests, reflecting the recent improvements to defrag latency.
  • The test defrag edge case has been removed. This test attempted to create N pages with EXACTLY equal fragmentation in an attempt to confuse the defrag logic. It's unlikely that this test was performing correctly, and had questionable value.
  • Tests for hash/list/set/zset/stream have been separated and standardized. It was unlikely that the old test was performing properly as none of the actual data structures were fragmented!

It's noted that pubsub doesn't appear to be defragging correctly. The old test was based on deletion of strings (only) which doesn't actually reflect what happens when a pubsub channel is removed. The test has been reduced to only check that pubsub is not damaged during defrag - but doesn't test for defrag efficacy. This isn't likely a significant issue as it would be unlikely to create many thousands of pubsub channels and then have associated fragmentation issues. #1774

@JimB123 JimB123 marked this pull request as draft February 21, 2025 21:48
Copy link

codecov bot commented Feb 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.97%. Comparing base (f85c933) to head (129afed).
Report is 9 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1762      +/-   ##
============================================
- Coverage     71.13%   70.97%   -0.17%     
============================================
  Files           123      123              
  Lines         65531    65552      +21     
============================================
- Hits          46616    46526      -90     
- Misses        18915    19026     +111     

see 22 files with indirect coverage changes

@JimB123 JimB123 force-pushed the defrag-tests branch 6 times, most recently from 22d2ce5 to fe36094 Compare February 24, 2025 23:06
@JimB123 JimB123 marked this pull request as ready for review February 25, 2025 01:25
@madolson madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Feb 25, 2025
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I didn't follow what all the tests are doing, so I'll just have to trust that they're testing what they're supposed to test. I just added some minor comments about naming and such.

We can merge it soon and see if it solves the problems. There seems to be some temporary and commented-out code though that you may want to clean up.

Comment on lines +43 to +46
# make logging a bit more readable
proc to_mb {bytes} {
return [format "%6.2f MB" [expr $bytes / 1024.0 / 1024.0]]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put these functions on top-level, i.e. outside of the run_solo block.

Possibly we can put some of these functions in some utility library, if you find a good place, but that's not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind-of consider the run_solo block to be the top level block for these tests. In general, I like to limit things to the smallest possible scope. So unless we're moving this to some general place, I think it's located ok. But I guess I don't feel strongly. If we move this though, maybe all of the other functions should move also? I'd rather just leave it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limiting things to a small scope is usually a good idea. Less nesting is usually also a good idea. :)

The nesting gives an illusion that these functions are within a namespace or something, but they aren't. The run_solo block doesn't define a local scope. The functions defined here are still global. The only effect of defining them inside run_solo is that their definition is deferred until run_solo does eval on it's block. I don't think this is a good feature of TCL.

If we ever want to split this into two separate run_solo blocks or something, or skip run_solo then it's not a benefit that they're defined inside one of them.

# used to create keys which increase in size by 2x
# key 0 will have 2 fields (0,1), key 1 will have 4 (0,1,2,3), key 2 will have 8 ...
# incby allows incrementing by 1 or 2 (but no other values will work properly)
proc next_exp_kf {key field {incby 1}} {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what this function returns and these comments don't seem to explain it in a way that I can understand. I can see it returns a list of two elements.

Can we name this function in a more generic way? It doesn't seem to be be specific to keys and fields, but rather to a sequence of numbers.

Copy link
Contributor Author

@JimB123 JimB123 Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old code, we were testing 1 big hash. However we have a different data structure for a small hash (which was untested).

The new code creates hashes (and sets, and sorted sets, and lists) in exponentially increasing size. Fragments them all, and then defrags.

As stated in the comment,
Hash 0 will contain 2 fields/values. Hash 1 will contain 4, Hash 2 will contain 8... Same with sets, lists, etc.

This function just helps mathematically by finding the next key/field numbered pair. It's a little mathematically simpler than passing in "n", but could be done that way also. (k would be the highest value where 2^(k+1) is <=n and f would be n - 2^(k+1))

It generates a sequence like this
(0 0), (0 1),
(1 0), (1 1), (1 2), (1 3),
(2 0), (2 1), (2 2), (2 3), (2 4), (2 5), (2 6), (2 7),
...

I'm open to naming suggestions, but it seems specific enough that I couldn't think of a good simple name. I figured it was better to just try to comment it and it would be obvious by usage.

Mostly this just helps make the loops simpler we don't need a nested loop: for key =... { for val = ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your reply here is better than the code comment. I can understand this. :)

As stated in the comment,
Hash 0 will contain 2 fields/values. Hash 1 will contain 4, Hash 2 will contain 8... Same with sets, lists, etc.

Yes, but the comment doesn't say that the function returns just one field-value pair as a list.

Your reply here says it: "This function just helps mathematically by finding the next key/field numbered pair". That's a missing piece.

This function just helps mathematically by finding the next key/field numbered pair. It's a little mathematically simpler than passing in "n", but could be done that way also. (k would be the highest power of 2^k that's less than n and f would be n - 2^k)

It generates a sequence like this
(0 0), (0 1),
(1 0), (1 1), (1 2), (1 3),
(2 0), (2 1), (2 2), (2 3), (2 4), (2 5), (2 6), (2 7),

This is good. I still don't fully understand what you do pass in to the function. You just say what you don't pass in: simpler than passing in "n". 😄

So, do I understand correctly:

  • key is the row, the sequence number, the first element in the returned two-element list
  • field is column, the index within a sequence, the second element in the returned two-element list

Is this right? No, wait, it increments key sometimes. Honestly I don't understand what this function really does return. It's just very confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eureka, it's just a single sequence! And you pass in an element in the sequence and it gives you the next one!

Now I got it. I needed to understood that it's just one sequence, not one per key.

I think this example can help:

Suggested change
proc next_exp_kf {key field {incby 1}} {
#
# Given an pair (key, field), returns the next pair in the following
# sequence:
#
# (0 0), (0 1),
# (1 0), (1 1), (1 2), (1 3),
# (2 0), (2 1), (2 2), (2 3), (2 4), (2 5), (2 6), (2 7),
proc next_exp_kf {key field {incby 1}} {

@@ -125,14 +125,17 @@ proc assert_refcount_morethan {key ref} {
# Wait for the specified condition to be true, with the specified number of
# max retries and delay between retries. Otherwise the 'elsescript' is
# executed.
proc wait_for_condition {maxtries delay e _else_ elsescript} {
proc wait_for_condition {maxtries delay e _else_ elsescript {_debug_ ""} {debugscript ""}} {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word debug describes what we use it for rather than what it does. It's evaluated between each retry and it's possible to do anything here, not only debug stuff. I think it should be called something that better describes what it does, like on-retry?

Or do you have a good explanation of why it should be called debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, obviously we've never had a use case for this before. And debugging is the only use case I can think of for this.

We're waiting for something to become true. Usually we just wait - but if we're debugging, we might want to collect some information while we're waiting.

It could be called before_retry (or maybe before_delay_before_retry ! Doh!) Or after_false?

But none of those names sound much better than debug to me. If you feel strongly about one, let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't feel strongly. I just wanted a good motivation for picking debug. Debug is fine.

But please add some sentence in comment above to document it.

Comment on lines +510 to +514
# $rd read
# }
# }
# --- BEGIN TEMPORARY CODE ---
log_frag "test start (pubsub)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you finish this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I created: #1774
I redesigned all of the defrag structure. I rewrote all of the tests. If I took responsibility for everything I found that doesn't work properly, I'd never stop. I'll let some pubsub expert deep-dive to determine why it's not defragging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense. Thanks for opening the issue.

Shall we link to this issue in the comment above then? We usually don't refer to issues in code comments, but this is a TODO that we want to track. The issue can provide more information and discussion. If someone fixes it, we want to close the issue.

JimB123 and others added 3 commits February 25, 2025 13:03
Co-authored-by: Viktor Söderqvist <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Comment on lines +510 to +514
# $rd read
# }
# }
# --- BEGIN TEMPORARY CODE ---
log_frag "test start (pubsub)"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I created: #1774
I redesigned all of the defrag structure. I rewrote all of the tests. If I took responsibility for everything I found that doesn't work properly, I'd never stop. I'll let some pubsub expert deep-dive to determine why it's not defragging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants