Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this crate aiming to parse HTML? #238

Closed
untitaker opened this issue Oct 24, 2020 · 11 comments · Fixed by #239
Closed

Is this crate aiming to parse HTML? #238

untitaker opened this issue Oct 24, 2020 · 11 comments · Fixed by #239

Comments

@untitaker
Copy link
Contributor

I've successfully used quick-xml to parse HTML, however I just noticed that quick-xml does not unescape things like  . So I am not entirely sure if quick-xml is generally supposed to handle HTML or only partially such that we can build our own parsers/unescapers on top.

@untitaker
Copy link
Contributor Author

@tafia can you provide some guidance here? I am willing to write the patches. Also just found #230

@tafia
Copy link
Owner

tafia commented Oct 28, 2020

You can parse a lot of HTML (disabling end tag checks and using the html attributes) but I can't guarantee you can parse all of them.
For the unescaping issue, this is more a bug (I didn't implement all possible unescape characters). I'd be happy to merge new ones when necessary.

@untitaker
Copy link
Contributor Author

I have created a PR at #239

@blankname
Copy link

blankname commented Oct 28, 2020

You may run into problems with raw text elements, which may contain '<' characters in their content.

You can work around that by deconstructing the quick_xml::Reader using into_underlying_reader when you encounter one of those elements and then manually reading until the end tag for the element before recreating the quick_xml::Reader using from_reader.

An opt-in option to handle these cases could probably be added to quick-xml, but I don't know if that would be desired (or if there are other cases that would cause problems).

Edit:
See @tafia's response below, if you don't need the content of the raw text elements there's no need to deconstruct the reader and read to the end tag yourself, you can just use the read_to_end method on quick_xml::Reader.

@untitaker
Copy link
Contributor Author

I think there could be a case made for skipping over subtrees even in XML, purely on performance grounds.

@tafia
Copy link
Owner

tafia commented Oct 31, 2020

You have the read_to_end method but I am not sure how it is linked to @blankname point.

@blankname
Copy link

Thanks for mentioning that method, I had missed it.

My point regarding the "opt-in" was that something like:
pub fn handle_html_raw_text_elements(&mut self, val: bool) -> &mut Reader<B>
could be added to quick_xml::Reader to enable treating everything between the start and end tags of raw text elements (like script, and style) as text even if it looks like a nested element.

So here:

<script>
  let j = 2;
  for (let i = 0; i<j; i++) {}
</script>

the entire content of the script tag would be treated as a Text event instead of treating <j as the start of a nested element.

I'm not sure if that would be something worth adding.

It would be nice to have a version of read_to_end that would return everything read before the specified end tag was found, maybe read_to_end_as_text.

So that using reader.read_to_end_as_text(b'script') after encountering the <script> tag in the example above, would return an Ok result containing:

  let j = 2;
  for (let i = 0; i<j; i++) {}

@tafia
Copy link
Owner

tafia commented Nov 3, 2020

This is definitely something worth adding, at least behind a feature flag

@untitaker
Copy link
Contributor Author

untitaker commented Nov 3, 2020

It would be nice to have a version of read_to_end that would return everything read before the specified end tag was found

I think with #208 it could be a single method that returns the slice, as constructing and discarding a slice is not too bad. I wonder if read_to_end/read_to_end_as_text could be a single method even with buffering, perhaps a generic function that takes a "sink" type as type param. I understand that it's hard to add this in backwards-compatible manner but I am concerned about API surface explosion.

@tafia
Copy link
Owner

tafia commented Dec 18, 2020

Just for context, I don't like breaking changes but I am definitely not against it as long as there is a good reason for that. Cargo makes handling breaking changes quite easy and this part is not a central part of the lib.

@untitaker
Copy link
Contributor Author

fwiw quick-xml didn't work out particularly well in the end so I created my own html parser: https://github.com/untitaker/html5gum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants