-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVF worker version mismatch during development #626
Comments
Thanks! I can look into it after the big refactor is merged in. :) |
I believe I just hit this, too. When I run
Also worth mentioning is that the error doesn't show up as ERROR in the output even though it is supposed to? https://github.com/paritytech/polkadot/blob/master/node/core/pvf/worker/src/common.rs#L53 |
Yes, exactly.
That's because it comes not from the node but from the worker which is a separate process and has its own logging environment. Not good, but logging from workers is a corner case, so it's not something critical. |
This looks like it would fix it: https://stackoverflow.com/a/49077267/6085242. Also docs
And for future-proofing in case of similar bugs we should add to the error message: "If restarting doesn't work, do The rebuilds are maybe not ideal, but this version check is needed... the alternative is to manually bump the node and worker versions on breaking changes, which seems like more work and really error-prone. |
It's needed in production, not in development. |
@mrcnski can this one be closed now? |
@s0me0ne-unkn0wn sure, I believe it's been fixed (though it's a tricky one and hard to test, though I did like two weeks of testing). We can always re-open if there are issues. |
I got a report that someone just ran into this by doing Error message for searchability:
I will monitor how often this comes up. Probably a Also, adding an env var that does the same thing as |
Looking at this mess a little wider, I'm not sure anymore we did the right thing with this version checking. We got a single report from someone who clearly messed things up himself by not following the upgrade guidelines and started fixing it right away, breaking the developer's experience. Should have stopped for a moment and wondered if that pays off. Anyway, now we have workers split out, and this check becomes crucial. One idea is to incorporate the release version into the worker binary name, but I'm not sure how to implement it with Cargo manifests. |
Well, say you merge
I'm not sure what you mean by this. What guidelines? |
I guess we can have some mitigations for this, too, with a descriptive error for each one:
Maybe we can use some structured data like protobufs or something, and have it versioned, instead of reinventing the wheel. |
The whole version-checking story began when some validator operator upgraded the node in place, not restarting it, and the old |
I see, well the check turned out to be useful for the binary separation. It just needs some tweaking. Perhaps we are thinking about it wrong: We should be communicating with workers using structured data following versioned schemas, and erroring if the schema versions don't match, instead of checking build versions. |
Well we should still rebuild the worker as often as possible and check the release version for sure, as worker logic may have changed. But it shouldn't be necessary to enforce every commit version during development, as Rob has said. |
The communication format is not the only concern and is not even a main concern. It's more about Wasmtime versioning. If the preparation worker is compiled with Wasmtime 1.0.1 and the execution worker with 1.0.2, chances are the artifact cannot be executed by the latter, despite the communication format didn't change. |
I wonder if we can programmatically check the wasmtime version. It might be available to us at compile-time. Some quick Kagi'ing is inconclusive so needs more research. |
One more idea, even more straightforward, is that Wasmtime should produce a rather clear error code in that case, I don't remember for sure, but it was something like "Cannot deserialize". If we could propagate that error from the Substrate executor, we'd know for sure it's either a versioning problem or an artifact otherwise unusable (not written completely to disk, or fs problems, etc.), and we shouldn't raise a dispute in that case. Even simpler, not propagating the error itself but just returning some error that we interpret as non-deterministic should work either. |
For now, maybe we could gate the checking code itself in both workers with @rphmeier, what's your experience, do you do debug or release builds during the development? |
That's a good idea and worth doing too, should be a separate issue. I don't remember if we treat deserialization errors as internal already, or if they are "RuntimeConstruction" errors (this enum in substrate is a mess and would need to be cleaned up first, I have an issue about it.) Couple issues I see though:
|
I still think that what we have right now should work. Re-building on every commit pretty much guarantees there are no issues. In practice though cargo may turn up to be unreliable, so we should wait for more user reports. In the meantime we can start hardening the interface more, maybe first trying to catch wasmtime mismatches programmatically.
I believe he's said he does release because otherwise the tests are prohibitively slow. |
The original incident report says the validator started to dispute every candidate, so it's not treated as internal.
Well, I hope it can't, or at least it shouldn't. If they make breaking changes, they make artifacts from the previous versions non-deserializable, that's how they handle it. Any kind of bug may be introduced, of course, but remember, this software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 🙂 |
Haven't hit a version mismatch yet, nor received more reports about it. :] And this PR makes the check more lenient:
Please re-open if there are more reports! |
During the development process, if the code of the
polkadot-node-core-pvf-worker
package is changed, it can lead to a worker version mismatch error on the next run. Probably caused by thepolkadot-node-code-pvf
package not being recompiled and is keeping the old version. Should be fixed as it impedes development.This does not affect either production or CI builds, as well as any other clean builds.
Some additional details are available here: paritytech/polkadot#7253 (comment)
The text was updated successfully, but these errors were encountered: