Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve custom element tags #951

Open
ras0q opened this issue Jan 23, 2025 · 6 comments · May be fixed by #952
Open

Preserve custom element tags #951

ras0q opened this issue Jan 23, 2025 · 6 comments · May be fixed by #952

Comments

@ras0q
Copy link

ras0q commented Jan 23, 2025

The Readability parser currently does not preserve custom element tags such as <my-custom-tag> during the parsing process. These custom tags are stripped out, which can result in the loss of important semantic information and custom function

Desired Feature

Add an option to specify tag names which should be preserved during the parsing.

var article = new Readability(document, {
  tagsToPreserve: ["my-custom-tag", "another-tag"],
}).parse();
@ras0q ras0q changed the title Preserve custom elements Preserve custom element tags Jan 23, 2025
ras0q added a commit to ras0q/readability that referenced this issue Jan 23, 2025
@ras0q ras0q linked a pull request Jan 23, 2025 that will close this issue
ras0q added a commit to ras0q/readability that referenced this issue Jan 23, 2025
@gijsk
Copy link
Contributor

gijsk commented Jan 27, 2025

@ras0q can you provide some context as to what the usecase for this is? How are you using readability, and with what kind of content where these custom tags are important?

I'm not aware of any part of readability that sort of checks nodenames against a list right now and throws them out if they do not match, so I'm having some difficulty understanding the reasoning for wanting this.

Your PR (thank you!) also unfortunately (I think) does not quite do what it claims to do - but I don't know if it still works for the usecase you have (and the description of what you want does not quite match the usecase) or if it doesn't in some cases that you have maybe not run into yet (ie the description is correct but the patch behaviour is not quite always). I will comment in the PR as well to explain a bit more.

For the ticket, I would really appreciate some background to help make a decision. :-)

@ras0q
Copy link
Author

ras0q commented Jan 31, 2025

@gijsk Thank you for the reply!
I was assuming a use case for MathJax v3, which includes a <math> tag in its own <mjx-container> tag to represent formulas in HTML.
I didn't think it made sense to write the MathJax support directly into mozilla/readability, so I decided to implement an option for a generic tag that holds.

However, it seems my test case was lacking, and the following test passes without the tagsToPreserve option
I don't understand why the <mjx-math> tags disappear and the <mjx-container> tags remain!
Could you please explain this principle to me?

    it("should keep `mjx-container` tags", function () {
      // https://mathjax.github.io/MathJax-demos-web/tex-chtml.html
      var dom = new JSDOM(
`<p>
  When <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" tabindex="0" ctxtmenu_counter="0" style="font-size: 123.3%; position: relative;"><mjx-math class="MJX-TEX" aria-hidden="true"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44E TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c2260"></mjx-c></mjx-mo><mjx-mn class="mjx-n" space="4"><mjx-c class="mjx-c30"></mjx-c></mjx-mn></mjx-math><mjx-assistive-mml unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><mo>≠</mo><mn>0</mn></math></mjx-assistive-mml></mjx-container>, there are two solutions to <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" tabindex="0" ctxtmenu_counter="1" style="font-size: 123.3%; position: relative;"><mjx-math class="MJX-TEX" aria-hidden="true"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44E TEX-I"></mjx-c></mjx-mi><mjx-msup><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi><mjx-script style="vertical-align: 0.363em;"><mjx-mn class="mjx-n" size="s"><mjx-c class="mjx-c32"></mjx-c></mjx-mn></mjx-script></mjx-msup><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2B"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="3"><mjx-c class="mjx-c1D44F TEX-I"></mjx-c></mjx-mi><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2B"></mjx-c></mjx-mo><mjx-mi class="mjx-i" space="3"><mjx-c class="mjx-c1D450 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c3D"></mjx-c></mjx-mo><mjx-mn class="mjx-n" space="4"><mjx-c class="mjx-c30"></mjx-c></mjx-mn></mjx-math><mjx-assistive-mml unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><msup><mi>x</mi><mn>2</mn></msup><mo>+</mo><mi>b</mi><mi>x</mi><mo>+</mo><mi>c</mi><mo>=</mo><mn>0</mn></math></mjx-assistive-mml></mjx-container> and they are
  <mjx-container class="MathJax CtxtMenu_Attached_0" jax="CHTML" display="true" tabindex="0" ctxtmenu_counter="2" style="font-size: 123.3%; position: relative;"><mjx-math display="true" class="MJX-TEX" aria-hidden="true" style="margin-left: 0px; margin-right: 0px;"><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D465 TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="4"><mjx-c class="mjx-c3D"></mjx-c></mjx-mo><mjx-texatom space="4" texclass="ORD"><mjx-mfrac><mjx-frac type="d"><mjx-num><mjx-nstrut type="d"></mjx-nstrut><mjx-mrow><mjx-mo class="mjx-n"><mjx-c class="mjx-c2212"></mjx-c></mjx-mo><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44F TEX-I"></mjx-c></mjx-mi><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-cB1"></mjx-c></mjx-mo><mjx-msqrt space="3"><mjx-sqrt><mjx-surd><mjx-mo class="mjx-n"><mjx-c class="mjx-c221A"></mjx-c></mjx-mo></mjx-surd><mjx-box style="padding-top: 0.087em;"><mjx-msup><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44F TEX-I"></mjx-c></mjx-mi><mjx-script style="vertical-align: 0.289em;"><mjx-mn class="mjx-n" size="s"><mjx-c class="mjx-c32"></mjx-c></mjx-mn></mjx-script></mjx-msup><mjx-mo class="mjx-n" space="3"><mjx-c class="mjx-c2212"></mjx-c></mjx-mo><mjx-mn class="mjx-n" space="3"><mjx-c class="mjx-c34"></mjx-c></mjx-mn><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44E TEX-I"></mjx-c></mjx-mi><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D450 TEX-I"></mjx-c></mjx-mi></mjx-box></mjx-sqrt></mjx-msqrt></mjx-mrow></mjx-num><mjx-dbox><mjx-dtable><mjx-line type="d"></mjx-line><mjx-row><mjx-den><mjx-dstrut type="d"></mjx-dstrut><mjx-mrow><mjx-mn class="mjx-n"><mjx-c class="mjx-c32"></mjx-c></mjx-mn><mjx-mi class="mjx-i"><mjx-c class="mjx-c1D44E TEX-I"></mjx-c></mjx-mi></mjx-mrow></mjx-den></mjx-row></mjx-dtable></mjx-dbox></mjx-frac></mjx-mfrac></mjx-texatom><mjx-mo class="mjx-n"><mjx-c class="mjx-c2E"></mjx-c></mjx-mo></mjx-math><mjx-assistive-mml unselectable="on" display="block"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>x</mi><mo>=</mo><mrow data-mjx-texclass="ORD"><mfrac><mrow><mo>−</mo><mi>b</mi><mo>±</mo><msqrt><msup><mi>b</mi><mn>2</mn></msup><mo>−</mo><mn>4</mn><mi>a</mi><mi>c</mi></msqrt></mrow><mrow><mn>2</mn><mi>a</mi></mrow></mfrac></mrow><mo>.</mo></math></mjx-assistive-mml></mjx-container>
</p>`
      );
      var expected_xhtml = `<div id="readability-page-1" class="page"><p>
  When <mjx-container jax="CHTML" tabindex="0" ctxtmenu_counter="0"><mjx-assistive-mml unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><mo>≠</mo><mn>0</mn></math></mjx-assistive-mml></mjx-container>, there are two solutions to <mjx-container jax="CHTML" tabindex="0" ctxtmenu_counter="1"><mjx-assistive-mml unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>a</mi><msup><mi>x</mi><mn>2</mn></msup><mo>+</mo><mi>b</mi><mi>x</mi><mo>+</mo><mi>c</mi><mo>=</mo><mn>0</mn></math></mjx-assistive-mml></mjx-container> and they are
  <mjx-container jax="CHTML" display="true" tabindex="0" ctxtmenu_counter="2"><mjx-assistive-mml unselectable="on" display="block"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>x</mi><mo>=</mo><mrow data-mjx-texclass="ORD"><mfrac><mrow><mo>−</mo><mi>b</mi><mo>±</mo><msqrt><msup><mi>b</mi><mn>2</mn></msup><mo>−</mo><mn>4</mn><mi>a</mi><mi>c</mi></msqrt></mrow><mrow><mn>2</mn><mi>a</mi></mrow></mfrac></mrow><mo>.</mo></math></mjx-assistive-mml></mjx-container>
</p></div>`;

      var content = new Readability(dom.window.document, {
        // Why the test passes without this option?
        // tagsToPreserve: ["mjx-container"],
      }).parse().content;
      expect(content).eql(expected_xhtml);
    });

@gijsk
Copy link
Contributor

gijsk commented Feb 4, 2025

It's because the mjx-math element has aria-hidden="true". I don't know why it does that - why is the content being hidden from assistive technology? Readability assumes that if that is the case, it's probably not semantically meaningful (as users of AT should have access to all the "important" content in the doc, ie the bits that are semantically meaningful).

If you know that all your content is like this you could pre-process to remove these (arguably wrong) attributes, and that might fix it?

If all mathjax content is produced like this then I'd argue that ought to change, really, but perhaps I'm missing context about why that is done to mathjax content. I guess in theory we could add logic to exempt specifically mathjax stuff from the aria-hidden logic, if there is some good reason why that is being done in the first place... but it feels kinda yucky.

@AlexTate
Copy link

@gijsk Would it be permissible to just append |MathJax to REGEXPS.okMaybeItsACandidate?

If the goal here is to retain equations rendered by MathJax (and I can't think why anyone wouldn't want that), then that might be one way to do it. So far it seems to be playing nicely with the demos over at MathJax-demos-web.

@gijsk
Copy link
Contributor

gijsk commented Feb 18, 2025

I'm surprised that that would work - "candidate" nodes are normally meant to be containers that contain all the relevant content, or paragraphs / significant portions of relevant content (as readability will ultimately pick the "best" one of those to return as the "readable" content of its input). My impression / recollection was that the mathjax bits get removed by the "remove hidden nodes" logic, and so if we wanted to have a carve-out for mathjax then it might be better to put it in

readability/Readability.js

Lines 2699 to 2704 in 118f015

//check for "fallback-image" so that wikimedia math images are displayed
(!node.hasAttribute("aria-hidden") ||
node.getAttribute("aria-hidden") != "true" ||
(node.className &&
node.className.includes &&
node.className.includes("fallback-image")))
, along with the one that's already there for wikipedia? :-)

@AlexTate
Copy link

@gijsk I think you're right.

So the basic structure of a typical MathJax element is:

<mjx-container class="MathJax CtxtMenu_Attached_0" ...
    <mjx-math aria-hidden="true" ...
    <mjx-assistive-mml ...

The mjx-math node is removed by _isProbablyVisible() as you note but the sibling mjx-assistive-mml and parent mjx-container nodes aren't. It seems the parent mjx-container is removed later in the _grabArticle() loop because it is recognized as an "unlikely candidate." That's because its class name matches REGEXPS.unlikelyCandidates on the substring menu.

readability/Readability.js

Lines 1117 to 1128 in 118f015

if (
this.REGEXPS.unlikelyCandidates.test(matchString) &&
!this.REGEXPS.okMaybeItsACandidate.test(matchString) &&
!this._hasAncestorTag(node, "table") &&
!this._hasAncestorTag(node, "code") &&
node.tagName !== "BODY" &&
node.tagName !== "A"
) {
this.log("Removing unlikely candidate - " + matchString);
node = this._removeAndGetNext(node);
continue;
}

This is where it gets weird. Somehow when Readability removes the mjx-math nodes it doesn't seem to affect the rendered equations and I'm not sure why. As long as Readability retains mjx-container > mjx-assistive-mml, the rendered equation remains. But if you visit the input URL and remove mjx-math manually, the equation disappears. At the end of the day this isn't really relevant because it's easy enough to add an exception for the tag in _isProbablyVisible().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants