Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect data section string literal hash collisions #77061

Merged
merged 8 commits into from
Feb 11, 2025

Conversation

jjonescz
Copy link
Member

@jjonescz jjonescz commented Feb 5, 2025

Motivated by this discussion: #76139 (comment)

TODO:

  • Update the spec.

@jjonescz jjonescz force-pushed the DataSectionStringLiterals-02-Collision branch from dd41ab3 to f4c6064 Compare February 6, 2025 10:18
@jjonescz jjonescz marked this pull request as ready for review February 6, 2025 15:01
@jjonescz jjonescz requested review from a team as code owners February 6, 2025 15:01
@jjonescz jjonescz requested review from AlekseyTs and cston February 6, 2025 15:01
// If there is a hash collision, we cannot fallback to normal string literal emit strategy
// because the selection of which literal would get which emit strategy would not be deterministic.
var messageProvider = @this.ModuleBuilder.CommonCompilation.MessageProvider;
diagnostics.Add(messageProvider.CreateDiagnostic(messageProvider.ERR_DataSectionStringLiteralHashCollision, syntaxNode.GetLocation(), previousText));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previousText

The string could be quite long. Would it make sense to truncate it?

Copy link
Contributor

@AlekseyTs AlekseyTs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (commit 4)

@jjonescz
Copy link
Member Author

jjonescz commented Feb 8, 2025

Thinking more about this, I wonder if it's better to not emit an error.

  1. There are actually two potential collisions - one for the generated <S> type names and one for the generated data field. Currently this PR only handles the former one.
  2. It seems that those collisions don't matter much - the code can be emitted and run without problems since the IL references types and fields by tokens, not names.
  3. Emitting an error for a collision reintroduces the problem that the feature aims to solve - if a user hits the error, they cannot compile anymore.
  4. Perhaps we could emit a warning? I guess some metadata inspection tools could have problems with the binary if there are duplicate names. But that doesn't seem to be such a big issue and hence could be left as a future work.

if (previousText != text)
{
// If there is a hash collision, we cannot fallback to normal string literal emit strategy
// because the selection of which literal would get which emit strategy would not be deterministic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to delay the assignment of the type name that gets emitted into the binary until after the typedef token is assigned? Assuming that typedef tokens are deterministic, the name generated from the typedef token would be deterministic as well. Also, the unique names generated from the typedef tokens can be a lot shorter than the hash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be possible, but I'm not sure the compiler's internal object model is ready for that, the Name of the symbol is likely used in other places before the typedef token is assigned.

Chuck has suggested another alternative where we would collect the string literals during binding, sort them by length and content, and then assign indices to them (and names based on that) just before emit.

Also, as I mentioned above, currently we share the machinery for synthesizing data fields with array initializers and u8 literals, and these fields are named using sha256. So ideally we would change those too so they also get names based on indices instead of hashes.

I can add that to the spec as future work and go with an error for now.

@jkotas
Copy link
Member

jkotas commented Feb 8, 2025

It is an error to have duplicate type names in an assembly. From ECMA-335: "There shall be no duplicate rows in the TypeDef table, based on TypeNamespace+TypeName (unless this is a nested type - see below) [ERROR]". The runtime behavior for malformed binaries is undefined. I understand why it happens to work fine in the current runtime.

I do not think it is a good idea for the compiler to generate malformed binaries silently even if it happens to work at the moment. It is better to produce an error. We have number of similar corner-case situations where the user needs to alter their code to workaround the internal compiler limitations. For example, very complex expression may fail to compile and users need to alter their code to make it work.

I have checked the behavior of a few tools on duplicate type names: ildasm/ilasm roundtrip fails, native aot compilation happens to handle it gracefully. I would not be surprised if we find a tool with silent bad codegen for malformed input with duplicate type names.

@jjonescz jjonescz enabled auto-merge (squash) February 11, 2025 08:11
@jjonescz jjonescz merged commit 19c9b9e into dotnet:main Feb 11, 2025
28 checks passed
@jjonescz jjonescz deleted the DataSectionStringLiterals-02-Collision branch February 11, 2025 09:35
@dotnet-policy-service dotnet-policy-service bot added this to the Next milestone Feb 11, 2025
@akhera99 akhera99 modified the milestones: Next, 17.14 P2 Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants