XML Files with a Byte Order Mark should work #165

bcorrigan · 2017-12-11T23:19:49Z

Hello,

I too was affected by bug #155 - this bug means that when xml-rs attempts to parse any XML with a Byte Order Mark at the start, it fails and errors out, even though this is valid UTF-8 XML.

This PR is my attempt to fix the issue. If it is not any good, let me know how to improve the implementation.

I fixed it by simply having it check for the BOM bytes in the edge case where it finds non-whitespace characters before the root tag, in outside_tag.rs.

I have also added a .gitattributes file because I found tests fail on windows machines, as they expect unix line endings.

Finally I changed the fourth sample XML so that it has a UTF-8 bom mark at start, this constitutes the test.

I did not consider any other BOM marks (eg UTF-16 ones etc) because xml-rs only cares about utf-8.

Thank you!

…nstead of throwing an error.

netvl · 2018-01-08T19:12:43Z

Sorry for the delay, my last month was a bit hectic :) And thank you for the pull request, I really appreciate it!

I see that you decided to handle this inside the parser itself, on the event level. I don't really like this approach, because it mixes handling of encoding and parsing. As I said in #10, ideally this should be handled completely transparently for the parser, on the level of the underlying Read instance, or maybe as a thin layer between the parser/lexer and the Read instance.

I actually started work on the new parser implementation in this branch, and I intend to incorporate the proper handling of encoding there, using the encoding_rs crate, including BOM. What I suggest to do at this point is to create a thin wrapper over Read or BufRead which strips the BOM from the underlying stream, if it exists. I believe that this should be very easy to do, and it is entirely possible that there is such a library on crates.io already.

lovasoa · 2020-05-19T16:52:10Z

I am ready to write the Read wrapper. Would you merge such a pull request ?

lovasoa · 2021-02-17T14:28:36Z

Why did you close that ?

bcorrigan · 2021-02-17T15:04:42Z

Because a) owner of repo made it clear it is a non-preferred approach and b) it had conflicts which I am unwilling to spend any time on now after 3 years as we just have workaround to inspect XML files for various bom markers and remove prior to using this library.

bcorrigan added 2 commits December 11, 2017 17:25

Test documents need to have unix line endings on windows

84a58d5

If the XML has a BOM (byte order mark) at the start, just ignore it i…

2160713

…nstead of throwing an error.

bcorrigan mentioned this pull request Dec 11, 2017

bug: epubs with a Byte Order Mark (BOM) are not parsed danigm/epub-rs#4

Closed

kvark mentioned this pull request May 8, 2020

Update khronos APIs to the latest, except for webgl brendanzab/gl-rs#518

Merged

bcorrigan closed this Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML Files with a Byte Order Mark should work #165

XML Files with a Byte Order Mark should work #165

bcorrigan commented Dec 11, 2017

netvl commented Jan 8, 2018 •

edited

Loading

lovasoa commented May 19, 2020

lovasoa commented Feb 17, 2021

bcorrigan commented Feb 17, 2021

XML Files with a Byte Order Mark should work #165

XML Files with a Byte Order Mark should work #165

Conversation

bcorrigan commented Dec 11, 2017

netvl commented Jan 8, 2018 • edited Loading

lovasoa commented May 19, 2020

lovasoa commented Feb 17, 2021

bcorrigan commented Feb 17, 2021

netvl commented Jan 8, 2018 •

edited

Loading