Is this crate aiming to parse HTML? #238

untitaker · 2020-10-24T20:16:27Z

I've successfully used quick-xml to parse HTML, however I just noticed that quick-xml does not unescape things like  . So I am not entirely sure if quick-xml is generally supposed to handle HTML or only partially such that we can build our own parsers/unescapers on top.

The text was updated successfully, but these errors were encountered:

untitaker · 2020-10-26T19:07:24Z

@tafia can you provide some guidance here? I am willing to write the patches. Also just found #230

tafia · 2020-10-28T00:53:46Z

You can parse a lot of HTML (disabling end tag checks and using the html attributes) but I can't guarantee you can parse all of them.
For the unescaping issue, this is more a bug (I didn't implement all possible unescape characters). I'd be happy to merge new ones when necessary.

untitaker · 2020-10-28T13:31:53Z

I have created a PR at #239

blankname · 2020-10-28T18:39:03Z

You may run into problems with raw text elements, which may contain '<' characters in their content.

You can work around that by deconstructing the quick_xml::Reader using into_underlying_reader when you encounter one of those elements and then manually reading until the end tag for the element before recreating the quick_xml::Reader using from_reader.

An opt-in option to handle these cases could probably be added to quick-xml, but I don't know if that would be desired (or if there are other cases that would cause problems).

Edit:
See @tafia's response below, if you don't need the content of the raw text elements there's no need to deconstruct the reader and read to the end tag yourself, you can just use the read_to_end method on quick_xml::Reader.

untitaker · 2020-10-28T18:49:43Z

I think there could be a case made for skipping over subtrees even in XML, purely on performance grounds.

tafia · 2020-10-31T10:18:22Z

You have the read_to_end method but I am not sure how it is linked to @blankname point.

blankname · 2020-10-31T16:28:54Z

Thanks for mentioning that method, I had missed it.

My point regarding the "opt-in" was that something like:
pub fn handle_html_raw_text_elements(&mut self, val: bool) -> &mut Reader<B>
could be added to quick_xml::Reader to enable treating everything between the start and end tags of raw text elements (like script, and style) as text even if it looks like a nested element.

So here:

<script>
  let j = 2;
  for (let i = 0; i<j; i++) {}
</script>

the entire content of the script tag would be treated as a Text event instead of treating <j as the start of a nested element.

I'm not sure if that would be something worth adding.

It would be nice to have a version of read_to_end that would return everything read before the specified end tag was found, maybe read_to_end_as_text.

So that using reader.read_to_end_as_text(b'script') after encountering the <script> tag in the example above, would return an Ok result containing:

  let j = 2;
  for (let i = 0; i<j; i++) {}

tafia · 2020-11-03T03:50:53Z

This is definitely something worth adding, at least behind a feature flag

untitaker · 2020-11-03T09:46:11Z

It would be nice to have a version of read_to_end that would return everything read before the specified end tag was found

I think with #208 it could be a single method that returns the slice, as constructing and discarding a slice is not too bad. I wonder if read_to_end/read_to_end_as_text could be a single method even with buffering, perhaps a generic function that takes a "sink" type as type param. I understand that it's hard to add this in backwards-compatible manner but I am concerned about API surface explosion.

tafia · 2020-12-18T09:52:16Z

Just for context, I don't like breaking changes but I am definitely not against it as long as there is a good reason for that. Cargo makes handling breaking changes quite easy and this part is not a central part of the lib.

untitaker · 2021-11-25T13:18:04Z

fwiw quick-xml didn't work out particularly well in the end so I created my own html parser: https://github.com/untitaker/html5gum

untitaker mentioned this issue Oct 28, 2020

fix: Unescape all existing HTML entities #239

Merged

tafia closed this as completed in #239 Oct 31, 2020

r10s mentioned this issue Jan 2, 2021

forbid active content when displaying html-messages deltachat/deltachat-core-rust#2127

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this crate aiming to parse HTML? #238

Is this crate aiming to parse HTML? #238

untitaker commented Oct 24, 2020

untitaker commented Oct 26, 2020

tafia commented Oct 28, 2020

untitaker commented Oct 28, 2020

blankname commented Oct 28, 2020 •

edited

Loading

untitaker commented Oct 28, 2020

tafia commented Oct 31, 2020

blankname commented Oct 31, 2020

tafia commented Nov 3, 2020

untitaker commented Nov 3, 2020 •

edited

Loading

tafia commented Dec 18, 2020

untitaker commented Nov 25, 2021

Is this crate aiming to parse HTML? #238

Is this crate aiming to parse HTML? #238

Comments

untitaker commented Oct 24, 2020

untitaker commented Oct 26, 2020

tafia commented Oct 28, 2020

untitaker commented Oct 28, 2020

blankname commented Oct 28, 2020 • edited Loading

untitaker commented Oct 28, 2020

tafia commented Oct 31, 2020

blankname commented Oct 31, 2020

tafia commented Nov 3, 2020

untitaker commented Nov 3, 2020 • edited Loading

tafia commented Dec 18, 2020

untitaker commented Nov 25, 2021

blankname commented Oct 28, 2020 •

edited

Loading

untitaker commented Nov 3, 2020 •

edited

Loading