-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect data section string literal hash collisions #77061
Detect data section string literal hash collisions #77061
Conversation
dd41ab3
to
f4c6064
Compare
src/Compilers/Core/Portable/CodeGen/PrivateImplementationDetails.cs
Outdated
Show resolved
Hide resolved
// If there is a hash collision, we cannot fallback to normal string literal emit strategy | ||
// because the selection of which literal would get which emit strategy would not be deterministic. | ||
var messageProvider = @this.ModuleBuilder.CommonCompilation.MessageProvider; | ||
diagnostics.Add(messageProvider.CreateDiagnostic(messageProvider.ERR_DataSectionStringLiteralHashCollision, syntaxNode.GetLocation(), previousText)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (commit 4)
Thinking more about this, I wonder if it's better to not emit an error.
|
if (previousText != text) | ||
{ | ||
// If there is a hash collision, we cannot fallback to normal string literal emit strategy | ||
// because the selection of which literal would get which emit strategy would not be deterministic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to delay the assignment of the type name that gets emitted into the binary until after the typedef token is assigned? Assuming that typedef tokens are deterministic, the name generated from the typedef token would be deterministic as well. Also, the unique names generated from the typedef tokens can be a lot shorter than the hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be possible, but I'm not sure the compiler's internal object model is ready for that, the Name of the symbol is likely used in other places before the typedef token is assigned.
Chuck has suggested another alternative where we would collect the string literals during binding, sort them by length and content, and then assign indices to them (and names based on that) just before emit.
Also, as I mentioned above, currently we share the machinery for synthesizing data fields with array initializers and u8 literals, and these fields are named using sha256. So ideally we would change those too so they also get names based on indices instead of hashes.
I can add that to the spec as future work and go with an error for now.
It is an error to have duplicate type names in an assembly. From ECMA-335: "There shall be no duplicate rows in the TypeDef table, based on TypeNamespace+TypeName (unless this is a nested type - see below) [ERROR]". The runtime behavior for malformed binaries is undefined. I understand why it happens to work fine in the current runtime. I do not think it is a good idea for the compiler to generate malformed binaries silently even if it happens to work at the moment. It is better to produce an error. We have number of similar corner-case situations where the user needs to alter their code to workaround the internal compiler limitations. For example, very complex expression may fail to compile and users need to alter their code to make it work. I have checked the behavior of a few tools on duplicate type names: ildasm/ilasm roundtrip fails, native aot compilation happens to handle it gracefully. I would not be surprised if we find a tool with silent bad codegen for malformed input with duplicate type names. |
Motivated by this discussion: #76139 (comment)
TODO: