Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore the host in rustc.verbose_version for metadata #7873

Closed
wants to merge 1 commit into from

Conversation

cuviper
Copy link
Member

@cuviper cuviper commented Feb 6, 2020

Cross-compiling the same target from different hosts should still
produce the same output from rustc, but cargo effects a difference by
hashing the full rustc.verbose_version, including host: <triple>. We
can filter that particular line to allow different hosts to produce the
same target metadata after all.

Cross-compiling the same target from different hosts should still
produce the same output from rustc, but cargo effects a difference by
hashing the full `rustc.verbose_version`, including `host: <triple>`. We
can filter that particular line to allow different hosts to produce the
same target metadata after all.
@rust-highfive
Copy link

r? @ehuss

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 6, 2020
@cuviper
Copy link
Member Author

cuviper commented Feb 6, 2020

For more context -- in Fedora, I have requests to add cross rust-std subpackages for wasm and mingw. These would naturally be noarch rpms, not specific to a particular host architecture, but koji asserts that each separate arch build produces the same content for their noarch subpackages. Having the host in the metadata hash meant this was never the case.

Also, I'm not sure how to add any test cases for this...

@alexcrichton
Copy link
Member

To make sure I understand this correctly, your goal here is to use the same compiler (revision, etc) compiled for different architectures. Each compiler, when targeting, for example wasm, should always produce the same output. Cargo, however, doesn't produce the same output because -C metadata contains the compiler -Vv output, which notably includes the host: ... line which is the only real difference between the compilers.

Does that sound right?

If so I think this change is fine and would agree that there's not really a great way we could add a test for this.

@ehuss
Copy link
Contributor

ehuss commented Feb 6, 2020

I'm a little concerned with how this might affect our longer term efforts to share artifacts between projects. If someone switches between different host compilers, it would overwrite the files, busting the cache. I don't really know how common that is, so it's hard to say if that's a realistic concern.

I was also trying to think if this would affect rust-lang/rust bootstrapping. But I'm pretty sure each host is segregated into a separate directory by x.py, right?

I'd like to see this documented as a footnote here.

@cuviper
Copy link
Member Author

cuviper commented Feb 6, 2020

Does that sound right?

I think you have it, but I'll give a fuller example. We start a build with a single source rpm, and then each architecture builds some native rpms and some noarch rpms, like this:

  • buildArch (rust-1.41.0-1.fc32.src.rpm, x86_64)
    • cargo-1.41.0-1.fc32.x86_64.rpm
    • rust-1.41.0-1.fc32.x86_64.rpm
    • rust-std-static-1.41.0-1.fc32.x86_64.rpm
    • rust-gdb-1.41.0-1.fc32.noarch.rpm
  • buildArch (rust-1.41.0-1.fc32.src.rpm, aarch64)
    • cargo-1.41.0-1.fc32.aarch64.rpm
    • rust-1.41.0-1.fc32.aarch64.rpm
    • rust-std-static-1.41.0-1.fc32.aarch64.rpm
    • rust-gdb-1.41.0-1.fc32.noarch.rpm
  • etc

So they all produce the same noarch content (rust-gdb shown here), and after that there's a deduplication step that makes sure those are all the same, so we can just ship one instance of a given noarch.rpm. They don't have to be bit-for-bit identical, basically just the same file list.

When we want to add something like wasm, it doesn't naturally belong to a given host, so it should probably be a noarch package too. Then each of those separate builds would produce something like rust-std-static-wasm32-unknown-unknown-1.41.0-1.fc32.noarch.rpm. To pass the dedupe step, those need to have the same contents, which is where the current metadata trips us up.

@cuviper
Copy link
Member Author

cuviper commented Feb 6, 2020

I'm a little concerned with how this might affect our longer term efforts to share artifacts between projects. If someone switches between different host compilers, it would overwrite the files, busting the cache. I don't really know how common that is, so it's hard to say if that's a realistic concern.

I have no idea if switching hosts with the same build/cache dir would be common.

I'm a little worried myself that this won't just affect cross-compiled artifacts, but that was necessary because the target metadata hash also depends on the metadata of all its dependencies, including build-deps.

I'm open to alternatives, if you have any ideas!

I was also trying to think if this would affect rust-lang/rust bootstrapping. But I'm pretty sure each host is segregated into a separate directory by x.py, right?

Yes, with rustbuild it's separated into build/{build-triple}/...

@alexcrichton
Copy link
Member

Oh ok, so this is a bug related to rustbuild (sorta) where you're running the rustc build system to produce a wasm target. That produdes libstd-xxx.rlib for wasm, and you want xxx to be the same across all architectures whereas now it's different because we hash the host of the compiler into xxx?

@ehuss hm I'm not sure what you mean by overwrite? Wouldn't this cause to different host compilers to share a cache where previously they wouldn't share anything?

I have no idea if switching hosts with the same build/cache dir would be common.

At least as a data point we've had folks use a shared folder between VMs for build caches sometimes, but honestly our story there is bad due to filesystem lock management and stuff anyways.

@ehuss
Copy link
Contributor

ehuss commented Feb 7, 2020

@ehuss hm I'm not sure what you mean by overwrite? Wouldn't this cause to different host compilers to share a cache where previously they wouldn't share anything?

The host is still in the fingerprint, so when you switch hosts it'll bust the fingerprint and Cargo will rebuild and overwrite the old files.

Hm, now I'm thinking of the case where some people may switch host compilers as an alternative to using --target. For example, running cargo +stable-pc-windows-msvc build and cargo +stable-pc-windows-gnu build. That might be more common.

Maybe, instead of hashing unit.kind, the metadata could hash unit.kind.short_name()? Maybe if we do that if we should also consider changing the fingerprint hash?

@cuviper
Copy link
Member Author

cuviper commented Feb 7, 2020

Oh ok, so this is a bug related to rustbuild (sorta) where you're running the rustc build system to produce a wasm target. That produces libstd-xxx.rlib for wasm, and you want xxx to be the same across all architectures whereas now it's different because we hash the host of the compiler into xxx?

Yes, exactly.

I'm wondering now if I could get away with just building these noarch subpackages on one of the builders, so there's no difference to worry about. At first glance, I think koji will still collect that together the way we would want. I'll investigate further, especially making sure this wouldn't violate Fedora policy, but if that's okay I might not need any cargo change after all.

@alexcrichton
Copy link
Member

@ehuss ah right that makes sense. I think though that unit.kind vs unit.kind.short_name() only matters when you don't pass --target, right? In @cuviper's use case the target is wasm which I think means the host platform only shows up in version_verbose.

I think we could perhaps fix this by removing the host: line from version_verbose altogether? Or at least not hash it in both locations? (metadata and fingerprint)

@Conan-Kudo
Copy link

So they all produce the same noarch content (rust-gdb shown here), and after that there's a deduplication step that makes sure those are all the same, so we can just ship one instance of a given noarch.rpm. They don't have to be bit-for-bit identical, basically just the same file list.

Note that this does do file digest validation, and if the content doesn't match, it'll fail the packages build.

@ehuss
Copy link
Contributor

ehuss commented Feb 12, 2020

I think though that unit.kind vs unit.kind.short_name() only matters when you don't pass --target, right

Right. My concern is that without --target when you swap between host compilers (like the example of gnu vs msvc, or gnu vs musl), it will cause a full rebuild if the host is not in the metadata. If it is not in the fingerprint, then it won't recompile at all which I think is the wrong behavior.

I think that is all solved if it hashes unit.kind.short_name() for the target. (Assuming different host compilers produce the exact same output for the same target.)

@cuviper
Copy link
Member Author

cuviper commented Feb 12, 2020

@Conan-Kudo

Note that this does do file digest validation, and if the content doesn't match, it'll fail the packages build.

I'm not sure what rpmdiff's F (digest) actually represents. Koji says this:

        # ignore differences in file size, md5sum, and mtime
        # (files may have been generated at build time and contain
        #  embedded dates or other insignificant differences)
        d = koji.rpmdiff.Rpmdiff(joinpath(basepath, first_rpm),
            joinpath(basepath, other_rpm), ignore='S5TN')

If size and checksum are allowed to differ, then it seems the F must be something else, no? I think that refers to just the header digest, whereas file digests are ignored by flag 5.

@cuviper
Copy link
Member Author

cuviper commented Feb 12, 2020

(Assuming different host compilers produce the exact same output for the same target.)

Even within the same host, they're only the exact same if you assume perfect reproducible builds. For the purpose of caching, they just need to be close enough to be compatible.

@cuviper
Copy link
Member Author

cuviper commented Feb 13, 2020

I'm wondering now if I could get away with just building these noarch subpackages on one of the builders, so there's no difference to worry about.

FWIW, I brought this up on Fedora devel, and while there's some concern, it seems there are other packages doing something similar already.

Still, it feels like there should still be a way to resolve these host/target differences vs caching. What if we split the difference between -C metadata and -C extra-filename so only the latter includes any triple? So the internal metadata hash would include all of the target-independent information -- which is important for dependency metadata too so stuff like proc-macros don't infect host info into the target. Then the extra-filename can be generated from that and add just the target triple, or the host triple if there's no --target.

@alexcrichton
Copy link
Member

@ehuss oh right, I see what you mean. CompileKind::Host doesn't have any name information so there's otherwise no filename differences between cargo +gnu build and cargo +msvc build.

I think my leaning towards this is:

  • We should strip the host: ... information sooner to avoid forgetting to strip it at some point later
  • We should then also hash unit.kind.short_name() into the metadata

I think that'll fix @cuviper's use case while still preserving the behavior in the case @ehuss is thinking, right?

@bors
Copy link
Contributor

bors commented Feb 20, 2020

☔ The latest upstream changes (presumably #7820) made this pull request unmergeable. Please resolve the merge conflicts.

@cuviper
Copy link
Member Author

cuviper commented Feb 25, 2020

@alexcrichton if the host triple is still in the metadata for build scripts, proc-macros, etc., then that will also affect the target metadata when dependencies are hashed in.

Any thoughts on separating the metadata and extra-filename? Basically, unit.kind.short_name() would only need to go into the file name, leaving metadata "pure".

@ehuss
Copy link
Contributor

ehuss commented Feb 25, 2020

Any thoughts on separating the metadata and extra-filename?

I implemented this last week. Here's a WIP: ehuss@b496870

It's not quite finished. My motivation was to see if there might be some way to get RUSTFLAGS back into the filename hash. I need to add the remap-path-prefix filter back in. I was also wondering if Cargo should have first-class support for reproducible builds, and got sidetracked. 😉

It's certainly something I could finish off. But for the sake of what this PR is trying to do, I'm a bit confused how it would help. If the goal is, "same artifacts for different host compilers", wouldn't different filenames defeat that?

@cuviper
Copy link
Member Author

cuviper commented Feb 25, 2020

If the goal is, "same artifacts for different host compilers", wouldn't different filenames defeat that?

The filename would be based on the target triple, regardless of host. The filenames of transient host pieces (build deps) would be different according to host compiler, but the end target files should be the same, and those are what we would ship. Right?

@ppannuto
Copy link

Ended up here from the Tock reproducibility effort, and wanted to comment quickly on:

I'm a little concerned with how this might affect our longer term efforts to share artifacts between projects. If someone switches between different host compilers, it would overwrite the files, busting the cache. I don't really know how common that is, so it's hard to say if that's a realistic concern.

I have no idea if switching hosts with the same build/cache dir would be common.

One of the Tock core developers, @brghena, keeps everything in Dropbox, including all git repos, and frequently switches between a Linux desktop and a Mac laptop. I would expect him to run into this approximately daily, and I suspect that there are many who may keep repositories in something that does cross-machine file system sync'ing.

@brghena
Copy link

brghena commented Apr 21, 2020

The answer for me is that it's rare that I'm developing on both Mac and Linux. I'm usually doing code on Linux (at work) and random documentation tasks on Mac (at home), so I wouldn't say I run into the problem frequently. However, the answer is that I've learned to run make clean (or the equivalent depending on what project I'm in) when switching between host computers, otherwise the build system fails, usually complaining about an inability to link something or another.

@gendx
Copy link

gendx commented Apr 22, 2020

Chiming in from #8140.

Hm, now I'm thinking of the case where some people may switch host compilers as an alternative to using --target. For example, running cargo +stable-pc-windows-msvc build and cargo +stable-pc-windows-gnu build. That might be more common.

I'm not really familiar with this cargo syntax, but what is the purpose of switching the host compiler? Forgive my naive understanding, but to give another example, to me nightly-2020-02-03-x86_64-unknown-linux-gnu and nightly-2020-02-03-x86_64-apple-darwin are the same compiler (nightly-2020-02-03) from a logical point-of-view (they should produce the same binaries) but one runs on a compiling machine with the x86_64-unknown-linux-gnu host vs. x86_64-apple-darwin for the other one.

Why would one want to switch the host compiler on the same machine? My understanding is that on a Linux machine the x86_64-unknown-linux-gnu compiler works but the x86_64-apple-darwin compiler doesn't work.

Now, my naive understanding is that without specifying a target, the compiler will target its own host, so that you can cargo run or cargo test. So some of the build artifacts will be host-specific in that case. But shouldn't Cargo put them in target/<host_target>/ then? So that if someone copies a target directory from Linux to OSX the artifacts are not reused. Or from an x86 machine to an ARM machine the artifacts are not re-used (I don't see much what could be re-used between different CPU architectures)?

In that case, in your example:

  • cargo +stable-pc-windows-msvc build would build into target/pc-windows-msvc/,
  • cargo +stable-pc-windows-gnu build would build into target/pc-windows-gnu/.

Maybe there could be some level of sharing between some targets to cache more things (same CPU? same kernel? same OS?), but by default it seems to me reasonable to (1) treat different targets completely independently and (2) when no --target parameter is provided, be equivalent to --target <host_target>.

In any case, I don't see how the host compiler should play a role in the determination of the metadata. My naive view of this is that it should only affect the default value of --target when no target is explicitly provided - the rest of the compiler shouldn't have to care about the host. Is there any other reason for the compiler to care about the host?

@ehuss
Copy link
Contributor

ehuss commented Apr 22, 2020

what is the purpose of switching the host compiler?

It's an alternate way to implicitly switch targets. One example I hit yesterday, running Cargo's own testsuite doesn't work when using --target, so you have to switch hosts. Some hosts can run multiple architectures (i686 or x86_64, or gnu vs musl, or gnu vs msvc).

But shouldn't Cargo put them in target/<host_target>/ then?

Cargo doesn't do that unless you specify --target.

when no --target parameter is provided, be equivalent to --target <host_target>.

I don't think we can change that behavior.

@gendx
Copy link

gendx commented Apr 23, 2020

what is the purpose of switching the host compiler?

It's an alternate way to implicitly switch targets. One example I hit yesterday, running Cargo's own testsuite doesn't work when using --target, so you have to switch hosts. Some hosts can run multiple architectures (i686 or x86_64, or gnu vs musl, or gnu vs msvc).

I find it surprising that using --target doesn't work, but I'm far from familiar with the code. In that scenario switching the host makes sense.

Yet, I find it weird to have both an explicit --target and an implicit one via the host, and that they don't yield the same results.

But shouldn't Cargo put them in target/<host_target>/ then?

Cargo doesn't do that unless you specify --target.

Indeed, if everything is put in the same root folder by default, it makes sense to include the "target" (be it implicit from the host) into the hash to separate unrelated build artifacts.

when no --target parameter is provided, be equivalent to --target <host_target>.

I don't think we can change that behavior.

Indeed, if cargo +stable-pc-windows-msvc build behaves differently than cargo +stable --target pc-windows-msvc build we can't unify the implicit host target with the explicit one.


Taking a step back on the discussion (emphasis mine):

I'm wondering now if I could get away with just building these noarch subpackages on one of the builders, so there's no difference to worry about.

FWIW, I brought this up on Fedora devel, and while there's some concern, it seems there are other packages doing something similar already.

Still, it feels like there should still be a way to resolve these host/target differences vs caching. What if we split the difference between -C metadata and -C extra-filename so only the latter includes any triple? So the internal metadata hash would include all of the target-independent information -- which is important for dependency metadata too so stuff like proc-macros don't infect host info into the target. Then the extra-filename can be generated from that and add just the target triple, or the host triple if there's no --target.

Is there a scenario where one provides an explicit --target yet also expects the host to be taken into account? It seems to me that:

  • hashing the host (as is done now) if no --target is provided,
  • hashing the target instead if an explicit --target is provided,

would solve both the cargo +stable-pc-windows-msvc build use case (switching the host compiler on the same machine) and the cargo --target=thumbv7em-none-eabi build use case (fixed target on various hosts & various machines).

Are there other scenarios to take into account?


I'm also not sure I understand the difference between metadata and extra-filename, i.e. in which case should they be different?

@cuviper
Copy link
Member Author

cuviper commented Apr 23, 2020

A crate's metadata hash includes the metadata of all its dependencies. Even for a target crate, the dependencies may include crates compiled for the host, like proc-macros. So if your target uses something like serde_derive, then that host metadata affects the target metadata too.

I'm also not sure I understand the difference between metadata and extra-filename, i.e. in which case should they be different?

Right now they're identical, but the new proposal was to split them. If the internal metadata is hashed without any target information, that would solve the target-host dependency above. But then putting the final host/target information in the filename hash will ensure we still have distinct build artifacts for caching, shared target/, etc.

@gendx
Copy link

gendx commented Apr 24, 2020

A crate's metadata hash includes the metadata of all its dependencies. Even for a target crate, the dependencies may include crates compiled for the host, like proc-macros. So if your target uses something like serde_derive, then that host metadata affects the target metadata too.

I'm not familiar with this matter, although conceptually I find it weird that:

  • Macros can be a host-dependent library. Shouldn't they be host-agnostic?
  • Things processing the source code can be a library at all, rather than part of the compiler.

Regardless, I think it's an acceptable trade-off. At least in the use case I'm considering (Tock), we limit to the maximum the number of external dependencies, and don't use procedural macros.

I think similar projects (embedded systems, firmware) likely have a similar limited number of dependencies (due to code size & memory constraints, the requirement that dependencies are no_std compatible, etc.). And other firmware projects are also likely to be interested in binary reproducibility at some point.

I'm also not sure I understand the difference between metadata and extra-filename, i.e. in which case should they be different?

Right now they're identical, but the new proposal was to split them. If the internal metadata is hashed without any target information, that would solve the target-host dependency above. But then putting the final host/target information in the filename hash will ensure we still have distinct build artifacts for caching, shared target/, etc.

This seems like it would work.

But if artifacts are separated by filename, why do we need a metadata parameter at all? Assuming a cryptographically-secure hash, there is no risk that a single file corresponds to multiple incompatible configuration options: e.g. changing features would change the filename, changing dependencies would change the filename, changing the CPU arch would change the filename, etc.

What does a variable metadata parameter bring on top?

@cuviper
Copy link
Member Author

cuviper commented Apr 24, 2020

What does a variable metadata parameter bring on top?

I guess it also helps isolate mangled symbol names, if nothing else. In fact, that's exactly how it's described in the output of rustc -Chelp:

    -C                 metadata=val -- metadata to mangle symbol names with

@gendx
Copy link

gendx commented Apr 27, 2020

What does a variable metadata parameter bring on top?

I guess it also helps isolate mangled symbol names, if nothing else. In fact, that's exactly how it's described in the output of rustc -Chelp:

    -C                 metadata=val -- metadata to mangle symbol names with

In practice it doesn't affect only symbol names, but also code generation (shuffle functions around, maybe even affects allocation of registers in CPU instructions, etc.). Hence the reproducibility problems (in my use case the symbols are already stripped from the binary we want to reproduce).

@cuviper
Copy link
Member Author

cuviper commented Apr 27, 2020

In practice it doesn't affect only symbol names, but also code generation (shuffle functions around, maybe even affects allocation of registers in CPU instructions, etc.).

In theory, the inputs to the metadata are all things that could affect codegen already.

(in my use case the symbols are already stripped from the binary we want to reproduce).

Even if they'll be stripped, they still need to be unique for correct linking, especially when crate duplication is involved (w/ multiple semver-incompatible versions in the dependency graph).

@ehuss
Copy link
Contributor

ehuss commented Aug 6, 2021

I'm gonna close this due to inactivity. Feel free to reopen or create a new PR when you've got time to work on this again. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants