-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve custom element tags #951
Comments
@ras0q can you provide some context as to what the usecase for this is? How are you using readability, and with what kind of content where these custom tags are important? I'm not aware of any part of readability that sort of checks nodenames against a list right now and throws them out if they do not match, so I'm having some difficulty understanding the reasoning for wanting this. Your PR (thank you!) also unfortunately (I think) does not quite do what it claims to do - but I don't know if it still works for the usecase you have (and the description of what you want does not quite match the usecase) or if it doesn't in some cases that you have maybe not run into yet (ie the description is correct but the patch behaviour is not quite always). I will comment in the PR as well to explain a bit more. For the ticket, I would really appreciate some background to help make a decision. :-) |
@gijsk Thank you for the reply! However, it seems my test case was lacking, and the following test passes without the it("should keep `mjx-container` tags", function () {
// https://mathjax.github.io/MathJax-demos-web/tex-chtml.html
var dom = new JSDOM(
`<p>
When <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" tabindex="0" ctxtmenu_counter="0" style="font-size: 123.3%; position: relative;"><mjx-math class="MJX-TEX" aria-hidden="true"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44E TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c2260"></mjx-c></mjx-mo><mjx-mn class="mjx-n" space="4"><mjx-c class="mjx-c30"></mjx-c></mjx-mn></mjx-math><mjx-assistive-mml unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><mo>≠</mo><mn>0</mn></math></mjx-assistive-mml></mjx-container>, there are two solutions to <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" tabindex="0" ctxtmenu_counter="1" style="font-size: 123.3%; position: relative;"><mjx-math class="MJX-TEX" aria-hidden="true"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44E TEX-I"></mjx-c></mjx-mi><mjx-msup><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi><mjx-script style="vertical-align: 0.363em;"><mjx-mn class="mjx-n" size="s"><mjx-c class="mjx-c32"></mjx-c></mjx-mn></mjx-script></mjx-msup><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2B"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="3"><mjx-c class="mjx-c1D44F TEX-I"></mjx-c></mjx-mi><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2B"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="3"><mjx-c class="mjx-c1D450 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c3D"></mjx-c></mjx-mo><mjx-mn class="mjx-n" space="4"><mjx-c class="mjx-c30"></mjx-c></mjx-mn></mjx-math><mjx-assistive-mml unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><msup><mi>x</mi><mn>2</mn></msup><mo>+</mo><mi>b</mi><mi>x</mi><mo>+</mo><mi>c</mi><mo>=</mo><mn>0</mn></math></mjx-assistive-mml></mjx-container> and they are
<mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" display="true" tabindex="0" ctxtmenu_counter="2" style="font-size: 123.3%; position: relative;"><mjx-math display="true" class="MJX-TEX" aria-hidden="true" style="margin-left: 0px; margin-right: 0px;"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c3D"></mjx-c></mjx-mo><mjx-texatom space="4" texclass="ORD"><mjx-mfrac><mjx-frac type="d"><mjx-num><mjx-nstrut type="d"></mjx-nstrut><mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c2212"></mjx-c></mjx-mo><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44F TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-cB1"></mjx-c></mjx-mo><mjx-msqrt space="3"><mjx-sqrt><mjx-surd><mjx-mo class="mjx-n"><mjx-c class="mjx-c221A"></mjx-c></mjx-mo></mjx-surd><mjx-box style="padding-top: 0.087em;"><mjx-msup><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44F TEX-I"></mjx-c></mjx-mi><mjx-script style="vertical-align: 0.289em;"><mjx-mn class="mjx-n" size="s"><mjx-c class="mjx-c32"></mjx-c></mjx-mn></mjx-script></mjx-msup><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2212"></mjx-c></mjx-mo><mjx-mn class="mjx-n" space="3"><mjx-c class="mjx-c34"></mjx-c></mjx-mn><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44E TEX-I"></mjx-c></mjx-mi><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D450 TEX-I"></mjx-c></mjx-mi></mjx-box></mjx-sqrt></mjx-msqrt></mjx-mrow></mjx-num><mjx-dbox><mjx-dtable><mjx-line type="d"></mjx-line><mjx-row><mjx-den><mjx-dstrut type="d"></mjx-dstrut><mjx-mrow><mjx-mn class="mjx-n"><mjx-c class="mjx-c32"></mjx-c></mjx-mn><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44E TEX-I"></mjx-c></mjx-mi></mjx-mrow></mjx-den></mjx-row></mjx-dtable></mjx-dbox></mjx-frac></mjx-mfrac></mjx-texatom><mjx-mo class="mjx-n"><mjx-c class="mjx-c2E"></mjx-c></mjx-mo></mjx-math><mjx-assistive-mml unselectable="on" display="block"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>x</mi><mo>=</mo><mrow data-mjx-texclass="ORD"><mfrac><mrow><mo>−</mo><mi>b</mi><mo>±</mo><msqrt><msup><mi>b</mi><mn>2</mn></msup><mo>−</mo><mn>4</mn><mi>a</mi><mi>c</mi></msqrt></mrow><mrow><mn>2</mn><mi>a</mi></mrow></mfrac></mrow><mo>.</mo></math></mjx-assistive-mml></mjx-container>
</p>`
);
var expected_xhtml = `<div id="readability-page-1" class="page"><p>
When <mjx-container jax="CHTML" tabindex="0" ctxtmenu_counter="0"><mjx-assistive-mml unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><mo>≠</mo><mn>0</mn></math></mjx-assistive-mml></mjx-container>, there are two solutions to <mjx-container jax="CHTML" tabindex="0" ctxtmenu_counter="1"><mjx-assistive-mml unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><msup><mi>x</mi><mn>2</mn></msup><mo>+</mo><mi>b</mi><mi>x</mi><mo>+</mo><mi>c</mi><mo>=</mo><mn>0</mn></math></mjx-assistive-mml></mjx-container> and they are
<mjx-container jax="CHTML" display="true" tabindex="0" ctxtmenu_counter="2"><mjx-assistive-mml unselectable="on" display="block"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>x</mi><mo>=</mo><mrow data-mjx-texclass="ORD"><mfrac><mrow><mo>−</mo><mi>b</mi><mo>±</mo><msqrt><msup><mi>b</mi><mn>2</mn></msup><mo>−</mo><mn>4</mn><mi>a</mi><mi>c</mi></msqrt></mrow><mrow><mn>2</mn><mi>a</mi></mrow></mfrac></mrow><mo>.</mo></math></mjx-assistive-mml></mjx-container>
</p></div>`;
var content = new Readability(dom.window.document, {
// Why the test passes without this option?
// tagsToPreserve: ["mjx-container"],
}).parse().content;
expect(content).eql(expected_xhtml);
}); |
It's because the If you know that all your content is like this you could pre-process to remove these (arguably wrong) attributes, and that might fix it? If all mathjax content is produced like this then I'd argue that ought to change, really, but perhaps I'm missing context about why that is done to mathjax content. I guess in theory we could add logic to exempt specifically mathjax stuff from the aria-hidden logic, if there is some good reason why that is being done in the first place... but it feels kinda yucky. |
@gijsk Would it be permissible to just append If the goal here is to retain equations rendered by MathJax (and I can't think why anyone wouldn't want that), then that might be one way to do it. So far it seems to be playing nicely with the demos over at MathJax-demos-web. |
I'm surprised that that would work - "candidate" nodes are normally meant to be containers that contain all the relevant content, or paragraphs / significant portions of relevant content (as readability will ultimately pick the "best" one of those to return as the "readable" content of its input). My impression / recollection was that the mathjax bits get removed by the "remove hidden nodes" logic, and so if we wanted to have a carve-out for mathjax then it might be better to put it in Lines 2699 to 2704 in 118f015
|
@gijsk I think you're right. So the basic structure of a typical MathJax element is: <mjx-container class="MathJax CtxtMenu_Attached_0" ...
<mjx-math aria-hidden="true" ...
<mjx-assistive-mml ... The mjx-math node is removed by Lines 1117 to 1128 in 118f015
This is where it gets weird. Somehow when Readability removes the mjx-math nodes it doesn't seem to affect the rendered equations and I'm not sure why. As long as Readability retains mjx-container > mjx-assistive-mml, the rendered equation remains. But if you visit the input URL and remove mjx-math manually, the equation disappears. At the end of the day this isn't really relevant because it's easy enough to add an exception for the tag in |
The Readability parser currently does not preserve custom element tags such as
<my-custom-tag>
during the parsing process. These custom tags are stripped out, which can result in the loss of important semantic information and custom functionDesired Feature
Add an option to specify tag names which should be preserved during the parsing.
The text was updated successfully, but these errors were encountered: