-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new scheduler from RFC 5 #746
Conversation
44850fa
to
0046aef
Compare
@@ -18,6 +18,7 @@ categories = ["concurrency"] | |||
[dependencies] | |||
num_cpus = "1.2" | |||
lazy_static = "1" | |||
crossbeam-channel = "0.4.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's slightly disappointing to add a dependency just for usually-disabled logging, but I guess it doesn't add much since we're already invested in crossbeam. I also see that we can't just use std::sync::mpsc
for that as it's !Sync
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original intention was to collect the events in per-thread queues and then fire them off somewhere central only every once in a while, but I decided to "just try" crossbeam-channel and see how it performs. It turned out to perform pretty well and logging had no observable impact on performance. Later I found that in some use cases it performed less well and does have an observable impact. Maybe I should swap this out for Mutex<Channel>
and we can revisit later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikomatsakis The tracing
crate might help here.
/// reaches zero. | ||
/// | ||
/// NB. We use a `CountLatch` here because it has no lifetimes and is | ||
/// meant for async use, but the count never gets higher than one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems rather kludgy, but I guess it's not anything critical.
}); | ||
self.job_completed_latch.set(); | ||
self.job_completed_latch | ||
.set_and_tickle_one(&self.registry, self.owner_thread_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is another area where we're potentially invalidating &self
, when the last spawn sets and its scope races to return. It's not an immediate problem, because these self.registry
and self.owner_thread_index
reads are completed as arguments before the actual set. But generally, it's still akin to rust-lang/rust#55005.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the concern here the (more or less unavoidable, without refactoring to use *
) UB that rust-lang/rust#55005 describes, or the possibility of accidentally reordering things? It does seem like a subtle invariant that isn't clear from the code.
(Around the latter, I'm wondering if we can refactor in some way to make that less likely... but I don't know how yet.)
rayon-core/src/sleep/counters.rs
Outdated
/// - The bits 0x4444 are the number of **sleeping threads**. | ||
/// | ||
/// See the struct `Counters` below. | ||
value: AtomicU64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we noted in chat, AtomicU64
is only available starting in Rust 1.34 (which is close to a year old for compat), but even then it's not necessarily available on all architectures. We should probably stick to AtomicUsize
, though we need to consider what that means for 32-bit.
AIUI the jobs/sleepy counters don't really care about true values, just comparing relative change, right? Will it be a problem if that has only an 8-bit rollover on 32-bit targets? Then the idle/sleeping threads seem like hard limits on our thread count, which we should enforce somewhere.
OK, I started writing up documentation again in SLEEP.md. I expect to update the RFC with those docs eventually, but I figured we would want "living docs" in the repo regardless. In the process, I'm pretty sure I found one bug having to do with our rollover algorithms. I'm also wondering whether the rollover protection that I added is "good enough" -- it's clearly possible for there to be missed events, which could lead to deadlock. I think before I judged them "sufficiently unlikely", but I'm not sure if I believe that anymore. A few notes: If a thread is announcing that it is sleepy, and it observes that incrementing the SEC would trigger rollover, it currently sets both the JEC and SEC to 0. But this seems wrong: the idea is that when a thread is in the sleepy state, it will ensure that JEC != SEC (and when another thread posts work, that thread will set JEC = SEC). But in this case, we've just gone sleepy, but we left the counters equal, so the other thread would assume it has nothing to do. I think we should set JEC = 0, SEC = 1 upon rollover instead. But beyond that, I'm a bit confused by what I was thinking with rollover scheme. The code as written, when a thread is getting sleepy it increments the SEC and keeps the value (let's call it SEC_A). Then, when it wants to sleep, it checks that JEC < SEC_A (otherwise it goes back to idle). But we could certainly have these events:
I see that in my original comment on the RFC, I described that part of the "go to sleep" protocol would be to check whether SEC < SEC_A, which would thus allow us to detect conditions like the above. But I don't seem to have implemented that check. Similarly, we could save the value of the JEC when we announce we are sleepy, and check whether JEC == JEC_A (i.e., has it changed at all), although that seems slightly riskier (see below). However, both these schemes only give us "probabilistic" protection against rollover:
All of this is making me think this scheme isn't great and we should use a different/simpler one. For example, we could add an extra If we did that, I was wondering if we truly need the SEC/JEC-- maybe we can get by with just a "sleepy bit" that we set to true? And then, when new work is posted, we clear the bit and increment JEC -- and if there will be rollover, we first increment some atomic epochal counter. Now, to check if new work was posted, we read the JEC/epochal counters and check whether they have changed. (I have to think a bit about the ordering of these two reads, since they can't be done atomically, but I think it would work out ok.) Or, maybe, to handle rollover we have the epoch counter guarded by a special mutex. When threads go to sleep (presumably rare and it's ok for it to be a bit costly), they also acquire this mutex before checking the JEC/SEC or whatever we're using. This way we can make adjustments to the epochal counter atomic. Anyway, have to go now, I'll tinker with this (or happy to have others do so), but I'm fairly convinced we need a simpler and more "obviously correct" scheme here (esp. since existing one is, well, just not correct, obviously or otherwise). |
Thought about this in the shower. Here is what I think we should do.
This setup seems
|
So I implemented the scheme I proposed above of "ignore rollover", but I hit two roadblocks
To handle the "external thread", we simply convert a call to To handle the cross-registry, though, the waiting is being done by a busy-loop on some other worker thread. That worker thread might even go to sleep while waiting. So we'd have to propagate this logic pretty deep, and I'm not thrilled about that. But I realize now that maybe there's a simpler answer. After a thread increments the "sleeping threads" counter -- or perhaps only if it is the last sleeping thread -- it can simply read the "injector" one last time and wake back up if it finds anything in there. After all, the danger is that
But if we just check for injected work after we increment the "sleeping threads" counter, then we would for sure catch this case. And the key point is that we must see the injected work because it is that step which increments the JEC back to JEC_A. (Further, since we have incremented the "sleeping threads" counter, if there is work actively being injected as we fall to sleep, it will either (a) first increment the JEC, in which case sleepy thread will see the work int he final check or (b) see that all threads are asleep.) Now if only I could see the problem in action so that I could feel confident it was fixed. Also, I am thinking I should go find my book on different ways to implement mutexes and see if there's any obvious analogues to this; as I said, I feel like I'm re-creating the wheel here. Ah well. |
OK, so, I've been thinking about this and I concluded that the right approach here is: First, the scheme I proposed is almost right but actually doesn't prevent deadlocks. The problem is that, so long as the person posting jobs just does a SeqCst read of the JEC, there is no requirement that the sleeping thread will see the job in the injection queue. This is because SeqCst reads don't synchronize with layer writes, even as they must be strongly ordered. In order to guarantee that, the person posting the job must either make a write to the JEC or else perhaps we can use fences. This is the final straw that convinces me I am way overthinking this. I think we should pare back to the simpler scheme where each time new work appears, we increment the JEC, always. When threads get sleepy, they can read the atomic counters. When they go to sleep, they can (a) check that (a) the JEC hasn't changed and (b) they can increment the number of sleepy threads. Presuming this succeeds, due to the potential for rollover, they must still do a check of the injection queue, but this way they are guaranteed to see any new work, because they will have observed the incremented JEC. We should measure that scheme and try to land it. Once we do that, we can tinker with other schemes. I feel like it would be great to have the majority of this work landed and then just to be focusing on this one minor thing. In particular I've been wanting to rethink some of the core things here. I can make those changes in the next day or two I imagine. |
I'll need more time to consider and provide any useful feedback on your proposed counter changes, but at a high level, I'm very much in favor of simplifying the counter scheme for now. If you think your new way is viable, please go ahead. |
@cuviper OK -- so I pushed a series of commits that switch over (and document) a very simple scheme wherein we increment a counter each time there is new work (and similarly when a thread goes to sleep). I also include an argument for why the scheme is deadlock free. Finally, I converted to a 32-bit (or 64-bit) counter using I think the history is now getting fairly messy, so it'd probably be good for me to do some squashing. Regarding the RFC, I intend to edit it at some point to defer some of the grungy implementation details to this PR. I was thinking it'd be good for us to talk also about what runtime measurements we want to be doing. For one thing, I should run some of our local demos and replot the results versus master, I suppose. |
I ran the demo benchmarks on Fedora 32, Ryzen 7 3800X (8 cores, 16 threads):
|
I added The biggest smoking gun BY FAR is
I think this is proving painful. BTW, what happens if the JEC increment overflows into the thread counters? |
Nevermind, somehow I was thinking of it like a single overflow bit, but of course wrapping add with the large |
@cuviper well -- eliminating that "per job increment" was indeed the focus of a lot of the work I was doing. I just gave up on it because I wanted something simple, but there are definitely options here. One of them is to use a fence. |
Maybe worth exploring them. I couldn't remember how much of a perf hit the increment was. |
I think that would help, because it wouldn't force the same cache ping pongs. Mostly, I think we want the busy hot path to avoid writing any shared state.
The worst of my results was |
@cuviper yeah that's no good. I will say I feel somewhat validated. I remember working very hard to avoid this increment and I was trying to decide if that was time well spent. =) Let me try to pop back to my last scheme. |
OK, I got as far as documentation the fence-based algorithm, but I didn't implement it yet. |
6127534
to
faefb77
Compare
@cuviper OK, new scheme is implemented. |
faefb77
to
15d1fbf
Compare
One minor point is that I'm pretty sure all the other "counter" operations can be |
New results:
|
c46b2f7
to
2cd05a0
Compare
This makes the "idle state" private to the sleep module.
(it is, after all, valid)
This avoids the need to increment the JEC on every new job while still, I believe, avoiding deadlock.
2cd05a0
to
96ba9ef
Compare
Here are some more measurements to see if we're fixing what we intended. We have two demos that report CPU time,
I've built with
So |
New benchmark results, on the 3800X:
|
Here are results of the running benchmarks (i.e.
These are raw numbers, not normalized / averaged / etc., but it looks reasonable to me. |
bors r+ |
Implementation of the scheduler described in rayon-rs/rfcs#5 -- modulo the fact that the RFC is mildly out of date. There is a walkthrough video available.
To Do List:
AtomicU64