Removing empty nodes early breaks medium.com images #337

Kdecherf · 2023-08-12T22:31:24Z

Considering the following snippet from a medium.com article:

<figure class="pd pe pf pg ph md mo mp paragraph-image">
   <div role="button" tabindex="0" class="pi pj ff pk bg pl">
      <div class="mo mp pc">
         <picture>
            <source srcset="[…]" type="image/webp"></source>
            <source data-testid="og" srcset="[…]"></source>
            <img alt="" class="bg mj mk c" width="651" height="478" loading="lazy" role="presentation">
         </picture>
   </div>
</div>
<figcaption class="ml mm mn mo mp mq mr be b bf z dt">Dow</figcaption></figure>

Currently graby removes these source tags because of this routine in Graby.php:

        // Remove empty lines to avoid runaway evaluation of following regex on badly coded websites
        $re = '/^[ \t]*[\r\n]+/m';
        $htmlCleaned = preg_replace($re, '', $html);

However, not keeping these tags actually breaks images because img does not define any src path (thanks medium).

As this routine is run before ContentExtractor, we can't use find/replace in site-config to prevent that.

Thus, I see two ways to deal with it, whether:

exclude more nodes (in addition to iframe, td and th)
or remove this routine and stop removing empty nodes at this point

The former feels like a infinite pain as we may see other exceptions over time, I would go for the latter imo.

Any thoughts @j0k3r @jtojnar?

On a side note, no, medium.com is not compliant with HTML specification as the source tag is a "void element", see https://developer.mozilla.org/en-US/docs/Web/HTML/Element/source#try_it

The text was updated successfully, but these errors were encountered:

jtojnar · 2023-08-12T23:43:54Z

I think the tidy step is somewhat valuable (maybe we could even extend it to remove img without src) so I would add source and other void elements to the list. It is not like new HTML elements appear that often.

~~Perhaps the medium page is XHTML5 page?~~ [Edit: Nevermind, they do not close the img.] For XML parser, self-termination should be equivalent to no content followed by a closing tag. I verified it with validator on the following document:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test void element with closing tag</title>
</head>
<body>
 <picture>
  <source srcset="[…]" type="image/webp"></source>
  <source data-testid="og" srcset="[…]"></source>
  <img alt="" class="bg mj mk c" width="651" height="478" loading="lazy" role="presentation">
 </picture>
</body>
</html>

Needs to be uploaded as xhtml file, direct input does not detect XML correctly.

Kdecherf · 2023-08-13T17:46:21Z

maybe we could even extend it to remove img without src

Removing img without src attribute would break picture tags containing it, even if source tags are present, see wallabag/wallabag#6414 (comment)

j0k3r · 2023-08-24T12:52:18Z

I think the fastest is to add source to the list.
About adding more elements to the list, is that one enough? https://developer.mozilla.org/en-US/docs/Glossary/Void_element

Funny I didn't see the iframe element in the list..

jtojnar · 2023-09-04T06:05:53Z

Oh, I see what Kevin means now. The filter cleans up all elements without content, even when they are meaningful. This includes void elements but is not exclusive to them. For example, empty tds are needed for empty table cells because removing them would jumble the table. And while we could enumerate void elements quite easily, the non-void ones are much harder since there is probably no official list of elements that are meaningful without content.

Kevin mentioned continuing to extend the blacklist, and removing the filter completely as the options. Alternately, we could also switch to a whitelist of elements to remove when empty (e.g. p|span|font). That would give us at least some tidiness. We could also log all other empty elements not in the whitelist to allow us to grow coverage over time.

But the choice of action depends on the goals of Graby – do we want to preserve content even when it might be a mess, or do we want a clean content model at the cost of it being potentially incomplete?

j0k3r · 2023-10-06T08:24:59Z

I'm 👍 for the whitelist

Kdecherf · 2023-10-06T09:01:41Z

Noted @j0k3r, I'll take care of that

Kdecherf changed the title ~~Removing empty nodes early break medium.com images~~ Removing empty nodes early breaks medium.com images Aug 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing empty nodes early breaks medium.com images #337

Removing empty nodes early breaks medium.com images #337

Kdecherf commented Aug 12, 2023

jtojnar commented Aug 12, 2023 •

edited

Loading

Kdecherf commented Aug 13, 2023

j0k3r commented Aug 24, 2023

jtojnar commented Sep 4, 2023

j0k3r commented Oct 6, 2023

Kdecherf commented Oct 6, 2023

Removing empty nodes early breaks medium.com images #337

Removing empty nodes early breaks medium.com images #337

Comments

Kdecherf commented Aug 12, 2023

jtojnar commented Aug 12, 2023 • edited Loading

Kdecherf commented Aug 13, 2023

j0k3r commented Aug 24, 2023

jtojnar commented Sep 4, 2023

j0k3r commented Oct 6, 2023

Kdecherf commented Oct 6, 2023

jtojnar commented Aug 12, 2023 •

edited

Loading