-
Notifications
You must be signed in to change notification settings - Fork 28
PB-398 Round resumption in Coordinator #285
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a general note: this mr currently contains quite some changes wich doesn't belong to the scope (like stuff for docker, logging, config) and which make the mr difficult to read. rebasing this feature branch on the latest dev branch should clean that up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for your detailed explanations :)
As discussed it would be nice to add a few extra checks, but that looks good to me already.
1d7f46b
to
63869c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the testing of your PR I got the following error:
xain-fl-dev_1 | Traceback (most recent call last):
xain-fl-dev_1 | File "/usr/local/bin/coordinator", line 11, in <module>
xain-fl-dev_1 | load_entry_point('xain-fl', 'console_scripts', 'coordinator')()
xain-fl-dev_1 | File "/app/xain_fl/__main__.py", line 52, in main
xain-fl-dev_1 | ['ipv4:172.22.0.1:35554' 'ipv4:172.22.0.1:35544' 'ipv4:172.22.0.1:35548']
xain-fl-dev_1 | serve(coordinator=coordinator, server_config=config.server)
xain-fl-dev_1 | File "/app/xain_fl/serve.py", line 46, in serve
xain-fl-dev_1 | monitor_heartbeats(coordinator, terminate_event)
xain-fl-dev_1 | File "/app/xain_fl/coordinator/heartbeat.py", line 46, in monitor_heartbeats
xain-fl-dev_1 | coordinator.remove_participant(participant_id)
xain-fl-dev_1 | File "/app/xain_fl/coordinator/coordinator.py", line 233, in remove_participant
xain-fl-dev_1 | self.round.remove_selected(participant_id)
xain-fl-dev_1 | File "/app/xain_fl/coordinator/round.py", line 43, in remove_selected
xain-fl-dev_1 | self.participant_ids.remove(participant_id)
xain-fl-dev_1 | AttributeError: 'numpy.ndarray' object has no attribute 'remove'
What I did is:
- Set
min_participants = 3
- start 3 participants
- kill one participant after the first round
I will check why it happens.
The issues comes from In
Is called in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works👍
* PB-398 add/remove selected in round * PB-398 order controller for selecting init segment of sorted * PB-398 split test_remove_participants; test_select_outstanding * PB-398 fix remove_participant; fix _handle_rendezvous * PB-398 preconditions; remove from list suggestion * PB-398 black format * PB-398 fix random controller non-list
References
PB-398
Summary
Adds proper support for round resumption in the Coordinator.
This merge request adds some resilience to the Coordinator in the face of Participant dropouts. In particular, when there is a dropout mid-round, the Coordinator pauses the round by going to standby. When enough Participants connect again, the round is resumed. While the idea of round resumption has been in the Coordinator for some time (and has recently been supported properly in the Participant too), up until now it has not been properly exercised, and this MR fixes the obvious faults, so that:
remove_participant
previously ignored the collection of selected Participants for a round; now it also takes that into account._handle_rendezvous
previously forced a complete re-selection each time a round resumes; the selection logic has been adjusted to only select from the connected pool while keeping the already selected survivors intact.In addition:
test_select_outstanding
tests the above functionality.test_remove_participant
has been split into 2 cases to properly test the above.is_finished
andget_weight_updates
of theRound
class.Are there any open tasks/blockers for the ticket?
This MR does not attempt to enable full fault-tolerance nor repair all the potential concurrency-related problems in the Coordinator e.g. ensuring various state collections are thread-safe, but these will be follow-up tasks going forward.
Reviewer checklist
Reviewer agreement:
Merge request checklist
XP-XXX <a description in imperative form>
.XP-XXX <a description in imperative form>
.Code checklist
XP-XXX-<a_small_stub>
.