Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML Files with a Byte Order Mark should work #165

Closed
wants to merge 2 commits into from
Closed

XML Files with a Byte Order Mark should work #165

wants to merge 2 commits into from

Conversation

bcorrigan
Copy link

Hello,

I too was affected by bug #155 - this bug means that when xml-rs attempts to parse any XML with a Byte Order Mark at the start, it fails and errors out, even though this is valid UTF-8 XML.

This PR is my attempt to fix the issue. If it is not any good, let me know how to improve the implementation.

I fixed it by simply having it check for the BOM bytes in the edge case where it finds non-whitespace characters before the root tag, in outside_tag.rs.

I have also added a .gitattributes file because I found tests fail on windows machines, as they expect unix line endings.

Finally I changed the fourth sample XML so that it has a UTF-8 bom mark at start, this constitutes the test.

I did not consider any other BOM marks (eg UTF-16 ones etc) because xml-rs only cares about utf-8.

Thank you!

@netvl
Copy link
Owner

netvl commented Jan 8, 2018

Sorry for the delay, my last month was a bit hectic :) And thank you for the pull request, I really appreciate it!

I see that you decided to handle this inside the parser itself, on the event level. I don't really like this approach, because it mixes handling of encoding and parsing. As I said in #10, ideally this should be handled completely transparently for the parser, on the level of the underlying Read instance, or maybe as a thin layer between the parser/lexer and the Read instance.

I actually started work on the new parser implementation in this branch, and I intend to incorporate the proper handling of encoding there, using the encoding_rs crate, including BOM. What I suggest to do at this point is to create a thin wrapper over Read or BufRead which strips the BOM from the underlying stream, if it exists. I believe that this should be very easy to do, and it is entirely possible that there is such a library on crates.io already.

@lovasoa
Copy link

lovasoa commented May 19, 2020

I am ready to write the Read wrapper. Would you merge such a pull request ?

@bcorrigan bcorrigan closed this Feb 17, 2021
@lovasoa
Copy link

lovasoa commented Feb 17, 2021

Why did you close that ?

@bcorrigan
Copy link
Author

Because a) owner of repo made it clear it is a non-preferred approach and b) it had conflicts which I am unwilling to spend any time on now after 3 years as we just have workaround to inspect XML files for various bom markers and remove prior to using this library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants