Detect data section string literal hash collisions #77061

jjonescz · 2025-02-05T18:22:01Z

Motivated by this discussion: #76139 (comment)

TODO:

Update the spec.

src/Compilers/Core/Portable/CodeGen/PrivateImplementationDetails.cs

src/Compilers/CSharp/Test/Emit/Emit/EmitMetadataTests.cs

src/Compilers/VisualBasic/Portable/Errors/MessageProvider.vb

src/Compilers/Core/Portable/CodeGen/PrivateImplementationDetails.cs

src/Compilers/Core/Portable/Emit/EmitOptions.cs

AlekseyTs · 2025-02-07T17:40:04Z

src/Compilers/Core/Portable/CodeGen/PrivateImplementationDetails.cs

+                    // If there is a hash collision, we cannot fallback to normal string literal emit strategy
+                    // because the selection of which literal would get which emit strategy would not be deterministic.
+                    var messageProvider = @this.ModuleBuilder.CommonCompilation.MessageProvider;
+                    diagnostics.Add(messageProvider.CreateDiagnostic(messageProvider.ERR_DataSectionStringLiteralHashCollision, syntaxNode.GetLocation(), previousText));


previousText

The string could be quite long. Would it make sense to truncate it?

AlekseyTs

LGTM (commit 4)

jjonescz · 2025-02-08T14:45:26Z

Thinking more about this, I wonder if it's better to not emit an error.

There are actually two potential collisions - one for the generated <S> type names and one for the generated data field. Currently this PR only handles the former one.
It seems that those collisions don't matter much - the code can be emitted and run without problems since the IL references types and fields by tokens, not names.
Emitting an error for a collision reintroduces the problem that the feature aims to solve - if a user hits the error, they cannot compile anymore.
Perhaps we could emit a warning? I guess some metadata inspection tools could have problems with the binary if there are duplicate names. But that doesn't seem to be such a big issue and hence could be left as a future work.

jkotas · 2025-02-08T17:47:32Z

src/Compilers/Core/Portable/CodeGen/PrivateImplementationDetails.cs

+                if (previousText != text)
+                {
+                    // If there is a hash collision, we cannot fallback to normal string literal emit strategy
+                    // because the selection of which literal would get which emit strategy would not be deterministic.


Would it be possible to delay the assignment of the type name that gets emitted into the binary until after the typedef token is assigned? Assuming that typedef tokens are deterministic, the name generated from the typedef token would be deterministic as well. Also, the unique names generated from the typedef tokens can be a lot shorter than the hash.

It might be possible, but I'm not sure the compiler's internal object model is ready for that, the Name of the symbol is likely used in other places before the typedef token is assigned.

Chuck has suggested another alternative where we would collect the string literals during binding, sort them by length and content, and then assign indices to them (and names based on that) just before emit.

Also, as I mentioned above, currently we share the machinery for synthesizing data fields with array initializers and u8 literals, and these fields are named using sha256. So ideally we would change those too so they also get names based on indices instead of hashes.

I can add that to the spec as future work and go with an error for now.

jkotas · 2025-02-08T17:56:52Z

It is an error to have duplicate type names in an assembly. From ECMA-335: "There shall be no duplicate rows in the TypeDef table, based on TypeNamespace+TypeName (unless this is a nested type - see below) [ERROR]". The runtime behavior for malformed binaries is undefined. I understand why it happens to work fine in the current runtime.

I do not think it is a good idea for the compiler to generate malformed binaries silently even if it happens to work at the moment. It is better to produce an error. We have number of similar corner-case situations where the user needs to alter their code to workaround the internal compiler limitations. For example, very complex expression may fail to compile and users need to alter their code to make it work.

I have checked the behavior of a few tools on duplicate type names: ildasm/ilasm roundtrip fails, native aot compilation happens to handle it gracefully. I would not be surprised if we find a tool with silent bad codegen for malformed input with duplicate type names.

jjonescz added Area-Compilers Feature - String Literals in Data Section as UTF8 labels Feb 5, 2025

dotnet-issue-labeler bot added the untriaged Issues and PRs which have not yet been triaged by a lead label Feb 5, 2025

jjonescz mentioned this pull request Feb 5, 2025

Add spec for the data section string literals feature #76139

Merged

dotnet-policy-service bot added the VSCode label Feb 5, 2025

Detect data section string literal hash collisions

f4c6064

jjonescz force-pushed the DataSectionStringLiterals-02-Collision branch from dd41ab3 to f4c6064 Compare February 6, 2025 10:18

jjonescz marked this pull request as ready for review February 6, 2025 15:01

jjonescz requested review from a team as code owners February 6, 2025 15:01

jjonescz requested review from AlekseyTs and cston February 6, 2025 15:01

ToddGrun reviewed Feb 6, 2025

View reviewed changes

src/Compilers/Core/Portable/CodeGen/PrivateImplementationDetails.cs Outdated Show resolved Hide resolved

Move check into GetOrAdd

2e2c070

cston reviewed Feb 6, 2025

View reviewed changes

src/Compilers/CSharp/Test/Emit/Emit/EmitMetadataTests.cs Show resolved Hide resolved

cston reviewed Feb 6, 2025

View reviewed changes

src/Compilers/VisualBasic/Portable/Errors/MessageProvider.vb Show resolved Hide resolved

cston reviewed Feb 6, 2025

View reviewed changes

src/Compilers/Core/Portable/CodeGen/PrivateImplementationDetails.cs Show resolved Hide resolved

cston reviewed Feb 6, 2025

View reviewed changes

src/Compilers/Core/Portable/Emit/EmitOptions.cs Outdated Show resolved Hide resolved

jjonescz added 2 commits February 7, 2025 13:26

Remove VB error

92370bb

Remove test only option from equality

361a87b

AlekseyTs reviewed Feb 7, 2025

View reviewed changes

cston approved these changes Feb 7, 2025

View reviewed changes

AlekseyTs approved these changes Feb 7, 2025

View reviewed changes

jkotas reviewed Feb 8, 2025

View reviewed changes

jjonescz added 3 commits February 10, 2025 12:03

Truncate the string in the error message

0b01c31

Merge branch 'main' into DataSectionStringLiterals-02-Collision

a981d3f

Update the spec

8fba961

cston approved these changes Feb 10, 2025

View reviewed changes

Update the spec

595c3b4

jjonescz enabled auto-merge (squash) February 11, 2025 08:11

jjonescz merged commit 19c9b9e into dotnet:main Feb 11, 2025
28 checks passed

jjonescz deleted the DataSectionStringLiterals-02-Collision branch February 11, 2025 09:35

dotnet-policy-service bot added this to the Next milestone Feb 11, 2025

This was referenced Feb 14, 2025

[Automated] PRs inserted in VS build main-35813.71 #77217

Closed

[Automated] PRs inserted in VS build feature.debugger.main-35813.224 #77225

Closed

333fred mentioned this pull request Feb 18, 2025

Merge main to runtime async branch #77265

Merged

dotnet-bot mentioned this pull request Feb 20, 2025

[Automated] PRs inserted in VS build feature.idex.dev18roaming-35819.94 #77285

Closed

akhera99 modified the milestones: Next, 17.14 P2 Feb 25, 2025

dotnet-bot mentioned this pull request Feb 26, 2025

[Automated] PRs inserted in VS build feature.d18initial-10325.04 #77339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect data section string literal hash collisions #77061

Detect data section string literal hash collisions #77061

jjonescz commented Feb 5, 2025 •

edited

Loading

AlekseyTs Feb 7, 2025

AlekseyTs left a comment

jjonescz commented Feb 8, 2025 •

edited

Loading

jkotas Feb 8, 2025

jjonescz Feb 10, 2025

jkotas commented Feb 8, 2025

Detect data section string literal hash collisions #77061

Detect data section string literal hash collisions #77061

Conversation

jjonescz commented Feb 5, 2025 • edited Loading

AlekseyTs Feb 7, 2025

Choose a reason for hiding this comment

AlekseyTs left a comment

Choose a reason for hiding this comment

jjonescz commented Feb 8, 2025 • edited Loading

jkotas Feb 8, 2025

Choose a reason for hiding this comment

jjonescz Feb 10, 2025

Choose a reason for hiding this comment

jkotas commented Feb 8, 2025

jjonescz commented Feb 5, 2025 •

edited

Loading

jjonescz commented Feb 8, 2025 •

edited

Loading