-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Livelock with backoff on Nqueens 11 #43
Comments
See: https://www.usenix.org/legacy/publications/compsystems/1990/sum_ruane.pdf
|
After modeling the Backoff mechanism in a formal language and runtime formal verification by model checking, I believe I've narrowed down the deadlock. There weave/weave/channels/event_notifiers.nim Lines 113 to 120 in b4c5a37
the parent worker loads its child worker state in register and exits if its Busy or ShouldWakeup. But while this happens, the child can send its intent to sleep weave/weave/channels/event_notifiers.nim Lines 88 to 95 in b4c5a37
And then follow up with sleeping weave/weave/channels/event_notifiers.nim Lines 97 to 106 in b4c5a37
The parent thread then sends its signal to shutdown but it's never received. Formal specification in TLA+ ------------------- MODULE event_notifiers -----------------------
(*
Formal specification of the event_notifiers datastructure.
It allows a single consumer to be put to sleep and being woken up
by multiple producers so that the consumer is able to consume the incoming messages (steal requests).
It combines a prepare phase "intendToSleep" with commit phase "wait".
In between the consumer sends a message to its parent that it is going to sleep.
The commit phase is aborted when any producers signal an incoming message.
There should be no deadlock, i.e. an incoming message being signaled but the consumer stays sleeping
as in the runtime the main thread may be the only one awake and wouldn't be able to awaken its children
in that case.
*)
EXTENDS Integers, TLC, Sequences, FiniteSets
CONSTANTS NumThreads, ConsumerTID
ASSUME NumThreads > 1
ASSUME ConsumerTID > 0
ASSUME ConsumerTID < NumThreads
MaxID == NumThreads-1
ParentTID == ConsumerTID \div 2
producers == (0..MaxID) \ {ConsumerTID, ParentTID}
(* PlusCal options (-termination) *)
(* --algorithm event_notifier
variables
consumerState = "Busy";
signaled = FALSE; \* Simulate a condition variable
msgToParent = "None"; \* Simulate a message to parent. I.e. an opportunity for thread interleaving
macro intendToSleep() begin
consumerState := "IntendToSleep"
end macro
\* Everything in a macro happens atomically
macro atomicCompareExchange(result, current, expected, newVal) begin
if current = expected then
current := newVal;
result := TRUE;
else
result := FALSE;
end if;
end macro;
\* Consumer wait until it is signaled to wakeup
procedure wait()
variables casSuccess = FALSE;
begin
WCAS1: atomicCompareExchange(casSuccess, consumerState, "IntendToSleep", "Parked");
if casSuccess then
W2: while consumerState = "Parked" do
\* The while loop protects against spurious wakeups
WCV3: await signaled;
signaled := FALSE;
end while;
end if;
W4: assert consumerState \in {"Parked", "ShouldWakeup"};
W5: consumerState := "Busy";
W8: return;
end procedure;
\* Notify the potentially waiting consumer that it should wake up
procedure notify()
variables localConsumerState = "N/A"; \* local view of the consumer state
begin
N1: localConsumerState := consumerState;
N2: if localConsumerState \in {"Busy", "ShouldWakeup"} then
N3: return;
end if;
N4: consumerState := "ShouldWakeup";
N5: while TRUE do
NSIG6: signaled := TRUE;
N7: if consumerState /= "Busy" then
N8: skip;
else
N9: return;
end if;
end while;
end procedure;
procedure mayRequestWork()
begin MaySteal:
either
\* Sometimes you have enough work
NoSteal: skip;
or
\* Sometimes you don't and you steal
Steal: call notify();
end either;
ReqRET: return;
end procedure
procedure mayShareWork()
\* A parent can also share work with a
\* a child that sent it "Waiting"
begin MayShare:
either
\* sometimes the parent doesn't have work
NoWork: skip;
or
\* Sometimes it has some
Share0: if msgToParent = "Waiting" then
Share1: call notify(); \* wakeup the child
Share2: msgToParent := "None"; \* dequeue the child steal request
end if;
end either;
ShareRET: return;
end procedure;
\* Not fair because they might never steal
process producer \in producers
begin Coworkers:
call mayRequestWork();
end process;
\* a parent will always run at least the termination
fair process parent = ParentTID
\* The order of work sharing and work stealing is arbitrary
begin ParentWork:
either
PMayRW0: call mayRequestWork();
PMaySW0: call mayShareWork();
or
PMaySW1: call mayShareWork();
PMayRW1: call mayRequestWork();
end either;
\* But it will for sure tell the consumer to terminate at one point
Terminate: call notify();
end process;
process consumer = ConsumerTID
begin ConsumerWork:
either
\* if we have work we work on it
FoundWork: skip;
or
\* we signal our intent to sleep, tell our parent and then sleep
Sleeping0: intendToSleep();
Sleeping1: msgToParent := "Waiting";
Sleeping2: call wait();
end either;
end process;
end algorithm; *) Error trace (keep an eye on the consumerState and localConsumerState) |
And it was not it, or it was hiding another one. Worker 1 parks Note that rarely this doesn't lead to a deadlock. This combined with the formal verification in #54 suggest that the deadlock is at the system level, probably some ordering issue between sending a steal request and receiving the waiting message. For example, between trySteal and declineAll (decline all processes the children wait status): Lines 121 to 129 in a064692
|
Okay so first of all it's not a deadlock it's a livelock. Beginning of investigation, here is the stacktrace sampling after a minute of livelock. Note: those are assumptions to get started with, and test various hypotheses:
Now we need to know when Lines 133 to 156 in a064692
In summary, here is the livelock as I understand it:
Now questions:
Disclaimer: this analysis is not backed by anything but my VTune stacktraces, the code and the unreliable screenshot from the previous post (unreliable because we can't rely on stdout print from interleave thread execution). I may have missed something. In short, I have no log or timestamped call sequences. |
Seems like with 2 threads on nqueens 11 we have a reliable deadlock :/
The text was updated successfully, but these errors were encountered: