-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore the host in rustc.verbose_version for metadata #7873
Conversation
Cross-compiling the same target from different hosts should still produce the same output from rustc, but cargo effects a difference by hashing the full `rustc.verbose_version`, including `host: <triple>`. We can filter that particular line to allow different hosts to produce the same target metadata after all.
r? @ehuss (rust_highfive has picked a reviewer for you, use r? to override) |
For more context -- in Fedora, I have requests to add cross Also, I'm not sure how to add any test cases for this... |
To make sure I understand this correctly, your goal here is to use the same compiler (revision, etc) compiled for different architectures. Each compiler, when targeting, for example wasm, should always produce the same output. Cargo, however, doesn't produce the same output because Does that sound right? If so I think this change is fine and would agree that there's not really a great way we could add a test for this. |
I'm a little concerned with how this might affect our longer term efforts to share artifacts between projects. If someone switches between different host compilers, it would overwrite the files, busting the cache. I don't really know how common that is, so it's hard to say if that's a realistic concern. I was also trying to think if this would affect rust-lang/rust bootstrapping. But I'm pretty sure each host is segregated into a separate directory by I'd like to see this documented as a footnote here. |
I think you have it, but I'll give a fuller example. We start a build with a single source rpm, and then each architecture builds some native rpms and some
So they all produce the same When we want to add something like wasm, it doesn't naturally belong to a given host, so it should probably be a |
I have no idea if switching hosts with the same build/cache dir would be common. I'm a little worried myself that this won't just affect cross-compiled artifacts, but that was necessary because the target metadata hash also depends on the metadata of all its dependencies, including build-deps. I'm open to alternatives, if you have any ideas!
Yes, with rustbuild it's separated into |
Oh ok, so this is a bug related to rustbuild (sorta) where you're running the rustc build system to produce a wasm target. That produdes @ehuss hm I'm not sure what you mean by overwrite? Wouldn't this cause to different host compilers to share a cache where previously they wouldn't share anything?
At least as a data point we've had folks use a shared folder between VMs for build caches sometimes, but honestly our story there is bad due to filesystem lock management and stuff anyways. |
The host is still in the fingerprint, so when you switch hosts it'll bust the fingerprint and Cargo will rebuild and overwrite the old files. Hm, now I'm thinking of the case where some people may switch host compilers as an alternative to using Maybe, instead of hashing |
Yes, exactly. I'm wondering now if I could get away with just building these |
@ehuss ah right that makes sense. I think though that I think we could perhaps fix this by removing the |
Note that this does do file digest validation, and if the content doesn't match, it'll fail the packages build. |
Right. My concern is that without I think that is all solved if it hashes |
I'm not sure what rpmdiff's # ignore differences in file size, md5sum, and mtime
# (files may have been generated at build time and contain
# embedded dates or other insignificant differences)
d = koji.rpmdiff.Rpmdiff(joinpath(basepath, first_rpm),
joinpath(basepath, other_rpm), ignore='S5TN') If size and checksum are allowed to differ, then it seems the |
Even within the same host, they're only the exact same if you assume perfect reproducible builds. For the purpose of caching, they just need to be close enough to be compatible. |
FWIW, I brought this up on Fedora devel, and while there's some concern, it seems there are other packages doing something similar already. Still, it feels like there should still be a way to resolve these host/target differences vs caching. What if we split the difference between |
@ehuss oh right, I see what you mean. I think my leaning towards this is:
I think that'll fix @cuviper's use case while still preserving the behavior in the case @ehuss is thinking, right? |
☔ The latest upstream changes (presumably #7820) made this pull request unmergeable. Please resolve the merge conflicts. |
@alexcrichton if the host triple is still in the metadata for build scripts, proc-macros, etc., then that will also affect the target metadata when dependencies are hashed in. Any thoughts on separating the |
I implemented this last week. Here's a WIP: ehuss@b496870 It's not quite finished. My motivation was to see if there might be some way to get RUSTFLAGS back into the filename hash. I need to add the It's certainly something I could finish off. But for the sake of what this PR is trying to do, I'm a bit confused how it would help. If the goal is, "same artifacts for different host compilers", wouldn't different filenames defeat that? |
The filename would be based on the target triple, regardless of host. The filenames of transient host pieces (build deps) would be different according to host compiler, but the end target files should be the same, and those are what we would ship. Right? |
Ended up here from the Tock reproducibility effort, and wanted to comment quickly on:
One of the Tock core developers, @brghena, keeps everything in Dropbox, including all git repos, and frequently switches between a Linux desktop and a Mac laptop. I would expect him to run into this approximately daily, and I suspect that there are many who may keep repositories in something that does cross-machine file system sync'ing. |
The answer for me is that it's rare that I'm developing on both Mac and Linux. I'm usually doing code on Linux (at work) and random documentation tasks on Mac (at home), so I wouldn't say I run into the problem frequently. However, the answer is that I've learned to run |
Chiming in from #8140.
I'm not really familiar with this cargo syntax, but what is the purpose of switching the host compiler? Forgive my naive understanding, but to give another example, to me Why would one want to switch the host compiler on the same machine? My understanding is that on a Linux machine the Now, my naive understanding is that without specifying a target, the compiler will target its own host, so that you can In that case, in your example:
Maybe there could be some level of sharing between some targets to cache more things (same CPU? same kernel? same OS?), but by default it seems to me reasonable to (1) treat different targets completely independently and (2) when no In any case, I don't see how the host compiler should play a role in the determination of the metadata. My naive view of this is that it should only affect the default value of |
It's an alternate way to implicitly switch targets. One example I hit yesterday, running Cargo's own testsuite doesn't work when using
Cargo doesn't do that unless you specify
I don't think we can change that behavior. |
I find it surprising that using Yet, I find it weird to have both an explicit
Indeed, if everything is put in the same root folder by default, it makes sense to include the "target" (be it implicit from the host) into the hash to separate unrelated build artifacts.
Indeed, if Taking a step back on the discussion (emphasis mine):
Is there a scenario where one provides an explicit
would solve both the Are there other scenarios to take into account? I'm also not sure I understand the difference between |
A crate's metadata hash includes the metadata of all its dependencies. Even for a target crate, the dependencies may include crates compiled for the host, like proc-macros. So if your target uses something like
Right now they're identical, but the new proposal was to split them. If the internal metadata is hashed without any target information, that would solve the target-host dependency above. But then putting the final host/target information in the filename hash will ensure we still have distinct build artifacts for caching, shared |
I'm not familiar with this matter, although conceptually I find it weird that:
Regardless, I think it's an acceptable trade-off. At least in the use case I'm considering (Tock), we limit to the maximum the number of external dependencies, and don't use procedural macros. I think similar projects (embedded systems, firmware) likely have a similar limited number of dependencies (due to code size & memory constraints, the requirement that dependencies are no_std compatible, etc.). And other firmware projects are also likely to be interested in binary reproducibility at some point.
This seems like it would work. But if artifacts are separated by filename, why do we need a metadata parameter at all? Assuming a cryptographically-secure hash, there is no risk that a single file corresponds to multiple incompatible configuration options: e.g. changing features would change the filename, changing dependencies would change the filename, changing the CPU arch would change the filename, etc. What does a variable metadata parameter bring on top? |
I guess it also helps isolate mangled symbol names, if nothing else. In fact, that's exactly how it's described in the output of
|
In practice it doesn't affect only symbol names, but also code generation (shuffle functions around, maybe even affects allocation of registers in CPU instructions, etc.). Hence the reproducibility problems (in my use case the symbols are already stripped from the binary we want to reproduce). |
In theory, the inputs to the metadata are all things that could affect codegen already.
Even if they'll be stripped, they still need to be unique for correct linking, especially when crate duplication is involved (w/ multiple semver-incompatible versions in the dependency graph). |
I'm gonna close this due to inactivity. Feel free to reopen or create a new PR when you've got time to work on this again. Thanks! |
Cross-compiling the same target from different hosts should still
produce the same output from rustc, but cargo effects a difference by
hashing the full
rustc.verbose_version
, includinghost: <triple>
. Wecan filter that particular line to allow different hosts to produce the
same target metadata after all.