-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async primary-secondary stepping #468
Comments
Currently, there is a lot of back-and-forth traffic between the simulation primary and the simulation secondary instances in order to keep them in complete lockstep. The current mechanism works as follows (assuming network is configured and connected before main simulation starts):
With some of the other work that we are doing with performers, we should be able to relax this constraint. The idea would be that if secondaries are simulating performers that are far apart in the world (so that they can't see or physically interact with each other), that the secondaries should be free to "run ahead" without having to worry about perfect lockstep. |
Is that new work that's planned or things that are already in?
I think we may need a mechanism to rewind the simulation to a point where two performers meet. If one performer is running much faster than the other, it may pass through a level as if the other robot weren't there, but later we find out that they should have been there at the same time. |
New work for me, while Ivan works on this.
This is true. I 'm thinking that we also have an upper limit on how far it can "run ahead" (on the order of seconds) |
I've been reading a docs/code and running the distributed simulation examples, I have a bunch of basic questions (there might be "too basic", sorry, but I'm still trying to understand the issue):
Does all that happen in the NetworkManagerPrimary::Step method? I see there's also a SerializedStateMap, but that seems to be shared from secondaries to the primary (I didn't see that being shared between secondaries, or from the primary to secondaries). I also didn't get what happens when a performer changes from level, is it affinity changed?
👍
👍 to a limit |
Great! In case you missed it, this tutorial has a lot of info: https://ignitionrobotics.org/api/gazebo/4.0/distributedsimulation.html .
Yeah that drives the whole step cycle.
That includes time and affinities, see the SimulationStep message.
Exactly, that goes from secondaries to the primary, and the primary uses that information to reassign affinities and display the simulation to the GUI.
Not necessarily. The affinities only need to be updated when performers from different secondaries end up in the same level. In this case, one of them will need to be transferred the other's secondary, so they can be simulated at the same time. The logic deciding the affinities is in NetworkManagerPrimary::PopulateAffinities, but it's incomplete, see #93.
The primary is responsible for reassigning these affinities and communicating the performer's state to the new secondary. At least that was the plan, I don't remember how much of that is already working.
Yes, a performer may be, and often is, in several levels at once. I believe that if a performer's affinity needs to be changed, all levels that it is in will need to come with it.
Good question. If that's enough to transplant a performer across secondaries, it should be enough to rewind. We're still missing a generic rewind mechanism, see #203. Once that's tackled it may help here.
Yeah, and maybe rewinded if it's too far ahead in time. |
One thing that Ian and I have been discussing is that (at least for SubT) that robot performers don't need to get their affinity changed, and that we should just collocate physics simulation for those performers. The reason for this is that many of the systems have internal state that isn't currently stored in the ECM or serializable, so transplanting them may be troublesome. |
Thanks for the answer @chapulina !! I have some extra questions/comments:
What is the overall goal? Is it to reduce network traffic or to avoid blocking secondaries? (or both?) Currently the primary seems to be reconstructing the world state based on the secondaries Something I was thinking is that instead of the primary checking if affinities need to be updated, secondaries could notify that (i.e. performer is entering a new level or not). I guess it's desired that the primary has the complete map state step to step, so the Out of curiosity, how does distributed simulation work when you have plugins bridging with ros/etc? e.g.: if a secondary is stepping those plugins, rewinding doesn't seem like an option (e.g robot mapping ...) It might be possible to allow secondaries to "run ahead" without needing to rewind, if there are some suppositions about "maximum performer speed", "maximum performer acceleration", etc (those could be global, or individual to the performer). |
I think this already answers one of my questions 😄 |
Both, but mostly the second. At least in my original testing, we weren't seeing much benefit to the distribution because we were spending so much time transacting state. Also, the blocking means that the whole simulation runs as slow as the slowest secondary. Some of this is unavoidable, due to lockstepping, but ideally we could "smooth out" some of the longer secondary steps?
Right, so it doesn't really in that case. Additionally, we don't have "ros-bridge" plugins in Ignition. Everything comes out on an ignition topic, and then that gets bridged over to ROS messages. I suppose one alternative would be to keep the simulated state a few seconds ahead of the state that actually comes out on ignition transport? It would require more bookkeeping, but would allow for rewinding without any discontinuity in the output state. |
Thanks @mjcarroll !
That sounds like a good idea. I did some basic profiling of the network manager primary step function, it seems that stepping all primary systems is what's blocking the longest (that took two orders of magnitude longer than updating the map state). Is it possible to already check if a performer is going to interact with another one without stepping all primary systems but directly after updating the primary map state with the secondaries? |
Yeah I think we may need to do some tricks to collocate parts of the simulation, but not all of it. Rendering would also need to be collocated for sensors, right? And while part of the simulation, say physics and rendering, is being simulated on one secondary, I think we can't really escape from lock-stepping with other secondaries that are running controllers that rely on that data, right?
Yeah that sounds like it could really speed things up. The primary is currently responsible for consolidating all state and reporting it to the GUI. We would need an alternative mechanism for that. I think the GUI could handle combining state received directly from all secondaries, for example, . |
@mjcarroll some extra questions about this: Who is currently publishing the state through ignition transport, both primaries and secondaries or only the primary? |
State sync is going both directions in this case. The primary aggregates the state from all of the secondaries, and then redistributes it. This is so that each secondary ultimately knows the location of all entities.
I'm starting with the assumption that each secondary is responsible for one performer (or potentially more performers, but stick to one for this). That secondary is responsible for all of the systems attached to that performer. In steady state, this would include running the three update states ( I think we could let each secondary "run ahead" (or maybe "pre-simulate") for a certain amount of time, by running all iterations that would only have an We could let it go further by extrapolating into the future, but I think this would end up with more "rewinds" than is desirable? If we choose to run further, I believe we would also have to disable any |
I have done two experiments, here some notes about them:
(I didn't push the code of the experiments because they are quite a hack 😃, but I can clean them up and push if needed) My next experiment is going to be: Let secondaries "go ahead" and send the results to the primary immediately. The primary aggregates results and after a number of aggregated iterations it sends an "ack" to secondaries, so secondaries can continue moving forward and can also clean the history of cached iterations that the primary already acknowledged. @mjcarroll does that idea sound reasonable?
I will assume for the moment that secondaries are running systems which state is completely serializable (IIUC they should only have an I'm not sure how systems that have non-serializable state can be distributed, I see two options:
I imagine that the second approach will slow down the simulation a lot, because it's basically using the current lockstep approach. |
Fix diff-drive crash when joints not found Approved-by: Ian Chen <[email protected]>
To summarize the progress made:
There was a big performance improvement compared with the previous distributed simulation implementation when running in a single machine, but it's still slower than the standalone simulation (which will probably always be the case when using a single machine), and I didn't test in multiple machines. The next step would be to test the scripts running secondaries in multiple machines, to see if for a reasonable number of robots there's an advantage of distributing the simulation. |
Support asynchronous stepping between primary and secondary simulation instances.
The text was updated successfully, but these errors were encountered: