Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing empty nodes early breaks medium.com images #337

Open
Kdecherf opened this issue Aug 12, 2023 · 6 comments
Open

Removing empty nodes early breaks medium.com images #337

Kdecherf opened this issue Aug 12, 2023 · 6 comments

Comments

@Kdecherf
Copy link
Collaborator

Considering the following snippet from a medium.com article:

<figure class="pd pe pf pg ph md mo mp paragraph-image">
   <div role="button" tabindex="0" class="pi pj ff pk bg pl">
      <div class="mo mp pc">
         <picture>
            <source srcset="[…]" type="image/webp"></source>
            <source data-testid="og" srcset="[…]"></source>
            <img alt="" class="bg mj mk c" width="651" height="478" loading="lazy" role="presentation">
         </picture>
   </div>
</div>
<figcaption class="ml mm mn mo mp mq mr be b bf z dt">Dow</figcaption></figure>

Currently graby removes these source tags because of this routine in Graby.php:

        // Remove empty lines to avoid runaway evaluation of following regex on badly coded websites
        $re = '/^[ \t]*[\r\n]+/m';
        $htmlCleaned = preg_replace($re, '', $html);

However, not keeping these tags actually breaks images because img does not define any src path (thanks medium).

As this routine is run before ContentExtractor, we can't use find/replace in site-config to prevent that.

Thus, I see two ways to deal with it, whether:

  • exclude more nodes (in addition to iframe, td and th)
  • or remove this routine and stop removing empty nodes at this point

The former feels like a infinite pain as we may see other exceptions over time, I would go for the latter imo.

Any thoughts @j0k3r @jtojnar?

On a side note, no, medium.com is not compliant with HTML specification as the source tag is a "void element", see https://developer.mozilla.org/en-US/docs/Web/HTML/Element/source#try_it

@Kdecherf Kdecherf changed the title Removing empty nodes early break medium.com images Removing empty nodes early breaks medium.com images Aug 12, 2023
@jtojnar
Copy link
Collaborator

jtojnar commented Aug 12, 2023

I think the tidy step is somewhat valuable (maybe we could even extend it to remove img without src) so I would add source and other void elements to the list. It is not like new HTML elements appear that often.


Perhaps the medium page is XHTML5 page? [Edit: Nevermind, they do not close the img.] For XML parser, self-termination should be equivalent to no content followed by a closing tag. I verified it with validator on the following document:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test void element with closing tag</title>
</head>
<body>
 <picture>
  <source srcset="[…]" type="image/webp"></source>
  <source data-testid="og" srcset="[…]"></source>
  <img alt="" class="bg mj mk c" width="651" height="478" loading="lazy" role="presentation">
 </picture>
</body>
</html>

Needs to be uploaded as xhtml file, direct input does not detect XML correctly.

@Kdecherf
Copy link
Collaborator Author

maybe we could even extend it to remove img without src

Removing img without src attribute would break picture tags containing it, even if source tags are present, see wallabag/wallabag#6414 (comment)

@j0k3r
Copy link
Owner

j0k3r commented Aug 24, 2023

I think the fastest is to add source to the list.
About adding more elements to the list, is that one enough? https://developer.mozilla.org/en-US/docs/Glossary/Void_element

Funny I didn't see the iframe element in the list..

@jtojnar
Copy link
Collaborator

jtojnar commented Sep 4, 2023

Oh, I see what Kevin means now. The filter cleans up all elements without content, even when they are meaningful. This includes void elements but is not exclusive to them. For example, empty tds are needed for empty table cells because removing them would jumble the table. And while we could enumerate void elements quite easily, the non-void ones are much harder since there is probably no official list of elements that are meaningful without content.

Kevin mentioned continuing to extend the blacklist, and removing the filter completely as the options. Alternately, we could also switch to a whitelist of elements to remove when empty (e.g. p|span|font). That would give us at least some tidiness. We could also log all other empty elements not in the whitelist to allow us to grow coverage over time.

But the choice of action depends on the goals of Graby – do we want to preserve content even when it might be a mess, or do we want a clean content model at the cost of it being potentially incomplete?

@j0k3r
Copy link
Owner

j0k3r commented Oct 6, 2023

I'm 👍 for the whitelist

@Kdecherf
Copy link
Collaborator Author

Kdecherf commented Oct 6, 2023

Noted @j0k3r, I'll take care of that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants