-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update and stabilize defrag tests #1762
base: unstable
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #1762 +/- ##
============================================
- Coverage 71.13% 70.97% -0.17%
============================================
Files 123 123
Lines 65531 65552 +21
============================================
- Hits 46616 46526 -90
- Misses 18915 19026 +111 |
22d2ce5
to
fe36094
Compare
Signed-off-by: Jim Brunner <[email protected]>
fe36094
to
129afed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! I didn't follow what all the tests are doing, so I'll just have to trust that they're testing what they're supposed to test. I just added some minor comments about naming and such.
We can merge it soon and see if it solves the problems. There seems to be some temporary and commented-out code though that you may want to clean up.
# make logging a bit more readable | ||
proc to_mb {bytes} { | ||
return [format "%6.2f MB" [expr $bytes / 1024.0 / 1024.0]] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put these functions on top-level, i.e. outside of the run_solo
block.
Possibly we can put some of these functions in some utility library, if you find a good place, but that's not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind-of consider the run_solo
block to be the top level block for these tests. In general, I like to limit things to the smallest possible scope. So unless we're moving this to some general place, I think it's located ok. But I guess I don't feel strongly. If we move this though, maybe all of the other functions should move also? I'd rather just leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limiting things to a small scope is usually a good idea. Less nesting is usually also a good idea. :)
The nesting gives an illusion that these functions are within a namespace or something, but they aren't. The run_solo
block doesn't define a local scope. The functions defined here are still global. The only effect of defining them inside run_solo
is that their definition is deferred until run_solo
does eval
on it's block. I don't think this is a good feature of TCL.
If we ever want to split this into two separate run_solo
blocks or something, or skip run_solo
then it's not a benefit that they're defined inside one of them.
# used to create keys which increase in size by 2x | ||
# key 0 will have 2 fields (0,1), key 1 will have 4 (0,1,2,3), key 2 will have 8 ... | ||
# incby allows incrementing by 1 or 2 (but no other values will work properly) | ||
proc next_exp_kf {key field {incby 1}} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand what this function returns and these comments don't seem to explain it in a way that I can understand. I can see it returns a list of two elements.
Can we name this function in a more generic way? It doesn't seem to be be specific to keys and fields, but rather to a sequence of numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old code, we were testing 1 big hash. However we have a different data structure for a small hash (which was untested).
The new code creates hashes (and sets, and sorted sets, and lists) in exponentially increasing size. Fragments them all, and then defrags.
As stated in the comment,
Hash 0 will contain 2 fields/values. Hash 1 will contain 4, Hash 2 will contain 8... Same with sets, lists, etc.
This function just helps mathematically by finding the next key/field numbered pair. It's a little mathematically simpler than passing in "n", but could be done that way also. (k
would be the highest value where 2^(k+1)
is <=n
and f
would be n - 2^(k+1)
)
It generates a sequence like this
(0 0), (0 1),
(1 0), (1 1), (1 2), (1 3),
(2 0), (2 1), (2 2), (2 3), (2 4), (2 5), (2 6), (2 7),
...
I'm open to naming suggestions, but it seems specific enough that I couldn't think of a good simple name. I figured it was better to just try to comment it and it would be obvious by usage.
Mostly this just helps make the loops simpler we don't need a nested loop: for key =... { for val = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your reply here is better than the code comment. I can understand this. :)
As stated in the comment,
Hash 0 will contain 2 fields/values. Hash 1 will contain 4, Hash 2 will contain 8... Same with sets, lists, etc.
Yes, but the comment doesn't say that the function returns just one field-value pair as a list.
Your reply here says it: "This function just helps mathematically by finding the next key/field numbered pair". That's a missing piece.
This function just helps mathematically by finding the next key/field numbered pair. It's a little mathematically simpler than passing in "n", but could be done that way also. (k would be the highest power of 2^k that's less than n and f would be n - 2^k)
It generates a sequence like this
(0 0), (0 1),
(1 0), (1 1), (1 2), (1 3),
(2 0), (2 1), (2 2), (2 3), (2 4), (2 5), (2 6), (2 7),
This is good. I still don't fully understand what you do pass in to the function. You just say what you don't pass in: simpler than passing in "n". 😄
So, do I understand correctly:
key
is the row, the sequence number, the first element in the returned two-element listfield
is column, the index within a sequence, the second element in the returned two-element list
Is this right? No, wait, it increments key sometimes. Honestly I don't understand what this function really does return. It's just very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eureka, it's just a single sequence! And you pass in an element in the sequence and it gives you the next one!
Now I got it. I needed to understood that it's just one sequence, not one per key.
I think this example can help:
proc next_exp_kf {key field {incby 1}} { | |
# | |
# Given an pair (key, field), returns the next pair in the following | |
# sequence: | |
# | |
# (0 0), (0 1), | |
# (1 0), (1 1), (1 2), (1 3), | |
# (2 0), (2 1), (2 2), (2 3), (2 4), (2 5), (2 6), (2 7), | |
proc next_exp_kf {key field {incby 1}} { |
@@ -125,14 +125,17 @@ proc assert_refcount_morethan {key ref} { | |||
# Wait for the specified condition to be true, with the specified number of | |||
# max retries and delay between retries. Otherwise the 'elsescript' is | |||
# executed. | |||
proc wait_for_condition {maxtries delay e _else_ elsescript} { | |||
proc wait_for_condition {maxtries delay e _else_ elsescript {_debug_ ""} {debugscript ""}} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word debug
describes what we use it for rather than what it does. It's evaluated between each retry and it's possible to do anything here, not only debug stuff. I think it should be called something that better describes what it does, like on-retry
?
Or do you have a good explanation of why it should be called debug
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, obviously we've never had a use case for this before. And debugging is the only use case I can think of for this.
We're waiting for something to become true. Usually we just wait - but if we're debugging, we might want to collect some information while we're waiting.
It could be called before_retry
(or maybe before_delay_before_retry
! Doh!) Or after_false
?
But none of those names sound much better than debug
to me. If you feel strongly about one, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't feel strongly. I just wanted a good motivation for picking debug
. Debug is fine.
But please add some sentence in comment above to document it.
# $rd read | ||
# } | ||
# } | ||
# --- BEGIN TEMPORARY CODE --- | ||
log_frag "test start (pubsub)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you finish this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I created: #1774
I redesigned all of the defrag structure. I rewrote all of the tests. If I took responsibility for everything I found that doesn't work properly, I'd never stop. I'll let some pubsub expert deep-dive to determine why it's not defragging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, makes sense. Thanks for opening the issue.
Shall we link to this issue in the comment above then? We usually don't refer to issues in code comments, but this is a TODO that we want to track. The issue can provide more information and discussion. If someone fixes it, we want to close the issue.
Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Jim Brunner <[email protected]>
# $rd read | ||
# } | ||
# } | ||
# --- BEGIN TEMPORARY CODE --- | ||
log_frag "test start (pubsub)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I created: #1774
I redesigned all of the defrag structure. I rewrote all of the tests. If I took responsibility for everything I found that doesn't work properly, I'd never stop. I'll let some pubsub expert deep-dive to determine why it's not defragging.
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Jim Brunner <[email protected]>
Addressing: #1746
A number of tests related to defrag have had stability problems.
One reason for stability issues is that the tests are run sequentially on the same server, and it takes jemalloc some time to release freed pages. This is especially noticed immediately after a flushall. Over a period of 10 seconds, it is observable that the "fragmentation bytes" can decrease by several MB. Another reason is that there's no standardization between tests. For each test, people have been independently hacking/tweaking the success criteria, without addressing underlying issues.
This update revamps all of the defrag tests:
log_frag
is now used for debug loggingperform_defrag_test
ensures that each test is written and executed in a uniform fashion. Limits are imposed to ensure that the defrag results are consistent/reproducible. The intent is to eliminate failures do to various tweaks to values in individual tests.defrag edge case
has been removed. This test attempted to create N pages with EXACTLY equal fragmentation in an attempt to confuse the defrag logic. It's unlikely that this test was performing correctly, and had questionable value.It's noted that pubsub doesn't appear to be defragging correctly. The old test was based on deletion of strings (only) which doesn't actually reflect what happens when a pubsub channel is removed. The test has been reduced to only check that pubsub is not damaged during defrag - but doesn't test for defrag efficacy. This isn't likely a significant issue as it would be unlikely to create many thousands of pubsub channels and then have associated fragmentation issues. #1774