Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out if anything is needed for better HTML integration #128

Closed
annevk opened this issue Dec 12, 2017 · 2 comments · Fixed by #203
Closed

Figure out if anything is needed for better HTML integration #128

annevk opened this issue Dec 12, 2017 · 2 comments · Fixed by #203

Comments

@annevk
Copy link
Member

annevk commented Dec 12, 2017

@hsivonen in whatwg/html#1077 (comment) raised a number of issues with the current integration points. They are insufficient for CSS, HTML, and presumably XML.

This might require some substantive changes to the hooks and perhaps other parts of the Encoding Standard, as well as standards that depend on the Encoding Standard (of which there are quite a few, so tread carefully).

Belated filing this to keep better track of it.

@annevk
Copy link
Member Author

annevk commented Jan 13, 2018

I think I'd personally be okay if the standard just said that you had to wait for 1024 bytes before decoding and if you could optimize around that, it would be okay too. The difference should only be observable performance-wise, which seems acceptable. And we can encourage implementations to do the fast thing.

I think that remains true if we add encoding sniffing.

Rewriting the specifications to have the proper abstractions would be somewhat nicer obviously, but seems like a lot more effort.

Note that we still have to change "decode" to also return the chosen encoding to the caller (and adjust any callers as appropriate).

@andreubotella
Copy link
Member

I'm reopening the discussion about this feature in whatwg/html#1077 (comment)

annevk pushed a commit that referenced this issue Mar 24, 2020
This change moves the BOM splitting part of the decode hook into a separate hook which does not consume any bytes of the token stream.

This will allow fixing a long-standing issue in the HTML encoding sniffing algorithm with the document's character encoding being set to the wrong result when there is a BOM: whatwg/html#1077.

Closes #128.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants