-
-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing empty nodes early breaks medium.com images #337
Comments
I think the tidy step is somewhat valuable (maybe we could even extend it to remove
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test void element with closing tag</title>
</head>
<body>
<picture>
<source srcset="[…]" type="image/webp"></source>
<source data-testid="og" srcset="[…]"></source>
<img alt="" class="bg mj mk c" width="651" height="478" loading="lazy" role="presentation">
</picture>
</body>
</html> Needs to be uploaded as xhtml file, direct input does not detect XML correctly. |
Removing |
I think the fastest is to add Funny I didn't see the |
Oh, I see what Kevin means now. The filter cleans up all elements without content, even when they are meaningful. This includes void elements but is not exclusive to them. For example, empty Kevin mentioned continuing to extend the blacklist, and removing the filter completely as the options. Alternately, we could also switch to a whitelist of elements to remove when empty (e.g. But the choice of action depends on the goals of Graby – do we want to preserve content even when it might be a mess, or do we want a clean content model at the cost of it being potentially incomplete? |
I'm 👍 for the whitelist |
Noted @j0k3r, I'll take care of that |
Considering the following snippet from a medium.com article:
Currently graby removes these
source
tags because of this routine inGraby.php
:However, not keeping these tags actually breaks images because
img
does not define any src path (thanks medium).As this routine is run before ContentExtractor, we can't use
find/replace
in site-config to prevent that.Thus, I see two ways to deal with it, whether:
The former feels like a infinite pain as we may see other exceptions over time, I would go for the latter imo.
Any thoughts @j0k3r @jtojnar?
On a side note, no, medium.com is not compliant with HTML specification as the
source
tag is a "void element", see https://developer.mozilla.org/en-US/docs/Web/HTML/Element/source#try_itThe text was updated successfully, but these errors were encountered: