update and stabilize defrag tests #1762

JimB123 · 2025-02-21T21:48:22Z

Addressing: #1746
A number of tests related to defrag have had stability problems.

One reason for stability issues is that the tests are run sequentially on the same server, and it takes jemalloc some time to release freed pages. This is especially noticed immediately after a flushall. Over a period of 10 seconds, it is observable that the "fragmentation bytes" can decrease by several MB. Another reason is that there's no standardization between tests. For each test, people have been independently hacking/tweaking the success criteria, without addressing underlying issues.

This update revamps all of the defrag tests:

A fresh server is started for each test. Running each test in isolation improves stability.
A uniform function log_frag is now used for debug logging
A uniform function perform_defrag_test ensures that each test is written and executed in a uniform fashion. Limits are imposed to ensure that the defrag results are consistent/reproducible. The intent is to eliminate failures do to various tweaks to values in individual tests.
Latency is tested much more strictly for most tests, reflecting the recent improvements to defrag latency.
The test defrag edge case has been removed. This test attempted to create N pages with EXACTLY equal fragmentation in an attempt to confuse the defrag logic. It's unlikely that this test was performing correctly, and had questionable value.
Tests for hash/list/set/zset/stream have been separated and standardized. It was unlikely that the old test was performing properly as none of the actual data structures were fragmented!

It's noted that pubsub doesn't appear to be defragging correctly. The old test was based on deletion of strings (only) which doesn't actually reflect what happens when a pubsub channel is removed. The test has been reduced to only check that pubsub is not damaged during defrag - but doesn't test for defrag efficacy. This isn't likely a significant issue as it would be unlikely to create many thousands of pubsub channels and then have associated fragmentation issues. #1774

codecov · 2025-02-21T22:01:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.97%. Comparing base (f85c933) to head (129afed).
Report is 9 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1762      +/-   ##
============================================
- Coverage     71.13%   70.97%   -0.17%     
============================================
  Files           123      123              
  Lines         65531    65552      +21     
============================================
- Hits          46616    46526      -90     
- Misses        18915    19026     +111

see 22 files with indirect coverage changes

Signed-off-by: Jim Brunner <[email protected]>

zuiderkwast

Very nice! I didn't follow what all the tests are doing, so I'll just have to trust that they're testing what they're supposed to test. I just added some minor comments about naming and such.

We can merge it soon and see if it solves the problems. There seems to be some temporary and commented-out code though that you may want to clean up.

tests/unit/memefficiency.tcl

zuiderkwast · 2025-02-25T18:14:35Z

tests/unit/memefficiency.tcl

+    # make logging a bit more readable
+    proc to_mb {bytes} {
+        return [format "%6.2f MB" [expr $bytes / 1024.0 / 1024.0]]
+    }


Let's put these functions on top-level, i.e. outside of the run_solo block.

Possibly we can put some of these functions in some utility library, if you find a good place, but that's not necessary.

I kind-of consider the run_solo block to be the top level block for these tests. In general, I like to limit things to the smallest possible scope. So unless we're moving this to some general place, I think it's located ok. But I guess I don't feel strongly. If we move this though, maybe all of the other functions should move also? I'd rather just leave it.

Limiting things to a small scope is usually a good idea. Less nesting is usually also a good idea. :)

The nesting gives an illusion that these functions are within a namespace or something, but they aren't. The run_solo block doesn't define a local scope. The functions defined here are still global. The only effect of defining them inside run_solo is that their definition is deferred until run_solo does eval on it's block. I don't think this is a good feature of TCL.

If we ever want to split this into two separate run_solo blocks or something, or skip run_solo then it's not a benefit that they're defined inside one of them.

zuiderkwast · 2025-02-25T18:22:37Z

tests/unit/memefficiency.tcl

+    # used to create keys which increase in size by 2x
+    # key 0 will have 2 fields (0,1), key 1 will have 4 (0,1,2,3), key 2 will have 8 ...
+    # incby allows incrementing by 1 or 2 (but no other values will work properly)
+    proc next_exp_kf {key field {incby 1}} {


I don't really understand what this function returns and these comments don't seem to explain it in a way that I can understand. I can see it returns a list of two elements.

Can we name this function in a more generic way? It doesn't seem to be be specific to keys and fields, but rather to a sequence of numbers.

In the old code, we were testing 1 big hash. However we have a different data structure for a small hash (which was untested).

The new code creates hashes (and sets, and sorted sets, and lists) in exponentially increasing size. Fragments them all, and then defrags.

As stated in the comment,
Hash 0 will contain 2 fields/values. Hash 1 will contain 4, Hash 2 will contain 8... Same with sets, lists, etc.

This function just helps mathematically by finding the next key/field numbered pair. It's a little mathematically simpler than passing in "n", but could be done that way also. (k would be the highest value where 2^(k+1) is <=n and f would be n - 2^(k+1))

It generates a sequence like this
(0 0), (0 1),
(1 0), (1 1), (1 2), (1 3),
(2 0), (2 1), (2 2), (2 3), (2 4), (2 5), (2 6), (2 7),
...

I'm open to naming suggestions, but it seems specific enough that I couldn't think of a good simple name. I figured it was better to just try to comment it and it would be obvious by usage.

Mostly this just helps make the loops simpler we don't need a nested loop: for key =... { for val = ...

Your reply here is better than the code comment. I can understand this. :)

As stated in the comment,
Hash 0 will contain 2 fields/values. Hash 1 will contain 4, Hash 2 will contain 8... Same with sets, lists, etc.

Yes, but the comment doesn't say that the function returns just one field-value pair as a list.

Your reply here says it: "This function just helps mathematically by finding the next key/field numbered pair". That's a missing piece.

This function just helps mathematically by finding the next key/field numbered pair. It's a little mathematically simpler than passing in "n", but could be done that way also. (k would be the highest power of 2^k that's less than n and f would be n - 2^k)

It generates a sequence like this
(0 0), (0 1),
(1 0), (1 1), (1 2), (1 3),
(2 0), (2 1), (2 2), (2 3), (2 4), (2 5), (2 6), (2 7),

This is good. I still don't fully understand what you do pass in to the function. You just say what you don't pass in: simpler than passing in "n". 😄

So, do I understand correctly:

key is the row, the sequence number, the first element in the returned two-element list

field is column, the index within a sequence, the second element in the returned two-element list

Is this right? No, wait, it increments key sometimes. Honestly I don't understand what this function really does return. It's just very confusing.

Eureka, it's just a single sequence! And you pass in an element in the sequence and it gives you the next one!

Now I got it. I needed to understood that it's just one sequence, not one per key.

I think this example can help:

Suggested change

proc next_exp_kf {key field {incby 1}} {

#

# Given an pair (key, field), returns the next pair in the following

# sequence:

#

# (0 0), (0 1),

# (1 0), (1 1), (1 2), (1 3),

# (2 0), (2 1), (2 2), (2 3), (2 4), (2 5), (2 6), (2 7),

proc next_exp_kf {key field {incby 1}} {

zuiderkwast · 2025-02-25T18:35:53Z

tests/support/test.tcl

@@ -125,14 +125,17 @@ proc assert_refcount_morethan {key ref} {
 # Wait for the specified condition to be true, with the specified number of
 # max retries and delay between retries. Otherwise the 'elsescript' is
 # executed.
-proc wait_for_condition {maxtries delay e _else_ elsescript} {
+proc wait_for_condition {maxtries delay e _else_ elsescript {_debug_ ""} {debugscript ""}} {


The word debug describes what we use it for rather than what it does. It's evaluated between each retry and it's possible to do anything here, not only debug stuff. I think it should be called something that better describes what it does, like on-retry?

Or do you have a good explanation of why it should be called debug?

Well, obviously we've never had a use case for this before. And debugging is the only use case I can think of for this.

We're waiting for something to become true. Usually we just wait - but if we're debugging, we might want to collect some information while we're waiting.

It could be called before_retry (or maybe before_delay_before_retry ! Doh!) Or after_false?

But none of those names sound much better than debug to me. If you feel strongly about one, let me know.

No, I don't feel strongly. I just wanted a good motivation for picking debug. Debug is fine.

But please add some sentence in comment above to document it.

tests/unit/memefficiency.tcl

zuiderkwast · 2025-02-25T19:10:53Z

tests/unit/memefficiency.tcl

+            #         $rd read
+            #     }
+            # }
+            # --- BEGIN TEMPORARY CODE ---
+            log_frag "test start (pubsub)"


Will you finish this?

No. I created: #1774
I redesigned all of the defrag structure. I rewrote all of the tests. If I took responsibility for everything I found that doesn't work properly, I'd never stop. I'll let some pubsub expert deep-dive to determine why it's not defragging.

Ok, makes sense. Thanks for opening the issue.

Shall we link to this issue in the comment above then? We usually don't refer to issues in code comments, but this is a TODO that we want to track. The issue can provide more information and discussion. If someone fixes it, we want to close the issue.

tests/unit/memefficiency.tcl

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Jim Brunner <[email protected]>

Signed-off-by: Jim Brunner <[email protected]>

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Jim Brunner <[email protected]>

tests/unit/memefficiency.tcl

JimB123 · 2025-02-25T21:37:12Z

tests/unit/memefficiency.tcl

+            #         $rd read
+            #     }
+            # }
+            # --- BEGIN TEMPORARY CODE ---
+            log_frag "test start (pubsub)"


No. I created: #1774
I redesigned all of the defrag structure. I rewrote all of the tests. If I took responsibility for everything I found that doesn't work properly, I'd never stop. I'll let some pubsub expert deep-dive to determine why it's not defragging.

tests/unit/memefficiency.tcl

Signed-off-by: Jim Brunner <[email protected]>

tests/unit/memefficiency.tcl

Signed-off-by: Jim Brunner <[email protected]>

tests/unit/memefficiency.tcl

Signed-off-by: Jim Brunner <[email protected]>

JimB123 marked this pull request as draft February 21, 2025 21:48

JimB123 force-pushed the defrag-tests branch 6 times, most recently from 22d2ce5 to fe36094 Compare February 24, 2025 23:06

update and stabilize defrag tests

129afed

Signed-off-by: Jim Brunner <[email protected]>

JimB123 force-pushed the defrag-tests branch from fe36094 to 129afed Compare February 24, 2025 23:57

JimB123 marked this pull request as ready for review February 25, 2025 01:25

madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Feb 25, 2025

zuiderkwast reviewed Feb 25, 2025

View reviewed changes

JimB123 and others added 3 commits February 25, 2025 13:03

Update tests/unit/memefficiency.tcl

913ff2a

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

9085bac

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

bf21ccd

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Jim Brunner <[email protected]>

JimB123 commented Feb 25, 2025

View reviewed changes

JimB123 added 11 commits February 25, 2025 13:47

Update tests/unit/memefficiency.tcl

5054ade

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

b4e1a23

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

88a7fdc

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

d2900b8

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

099fa95

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

7d5b6e9

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

2ba3198

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

805ddf3

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

e9ef373

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

52a2b41

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

db00a1a

Signed-off-by: Jim Brunner <[email protected]>

JimB123 commented Feb 25, 2025

View reviewed changes

Update tests/unit/memefficiency.tcl

133ffbc

Signed-off-by: Jim Brunner <[email protected]>

JimB123 added 5 commits February 25, 2025 14:02

Update tests/unit/memefficiency.tcl

192254a

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

71090b4

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

0000f80

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

99f4ec7

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

0cf016c

Signed-off-by: Jim Brunner <[email protected]>

JimB123 commented Feb 25, 2025

View reviewed changes

tests/unit/memefficiency.tcl Outdated Show resolved Hide resolved

tests/unit/memefficiency.tcl Outdated Show resolved Hide resolved

JimB123 added 2 commits February 25, 2025 14:14

Update tests/unit/memefficiency.tcl

37e29ef

Signed-off-by: Jim Brunner <[email protected]>

Update tests/unit/memefficiency.tcl

9e16cb3

Signed-off-by: Jim Brunner <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update and stabilize defrag tests #1762

update and stabilize defrag tests #1762

JimB123 commented Feb 21, 2025 •

edited

Loading

codecov bot commented Feb 21, 2025 •

edited

Loading

zuiderkwast left a comment

zuiderkwast Feb 25, 2025

JimB123 Feb 25, 2025

zuiderkwast Feb 26, 2025

zuiderkwast Feb 25, 2025

JimB123 Feb 25, 2025 •

edited

Loading

zuiderkwast Feb 26, 2025

zuiderkwast Feb 26, 2025

zuiderkwast Feb 25, 2025

JimB123 Feb 25, 2025

zuiderkwast Feb 26, 2025

zuiderkwast Feb 25, 2025

JimB123 Feb 25, 2025

zuiderkwast Feb 26, 2025

JimB123 Feb 25, 2025

-    proc next_exp_kf {key field {incby 1}} {
+    #
+    # Given an pair (key, field), returns the next pair in the following
+    # sequence:
+    #
+    # (0 0), (0 1),
+    # (1 0), (1 1), (1 2), (1 3),
+    # (2 0), (2 1), (2 2), (2 3), (2 4), (2 5), (2 6), (2 7),
+    proc next_exp_kf {key field {incby 1}} {

update and stabilize defrag tests #1762

Are you sure you want to change the base?

update and stabilize defrag tests #1762

Conversation

JimB123 commented Feb 21, 2025 • edited Loading

codecov bot commented Feb 21, 2025 • edited Loading

Codecov Report

zuiderkwast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimB123 Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimB123 commented Feb 21, 2025 •

edited

Loading

codecov bot commented Feb 21, 2025 •

edited

Loading

JimB123 Feb 25, 2025 •

edited

Loading