Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Preserve custom element tags #952

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ The `options` object accepts a number of properties, all optional:
* `serializer` (function, default `el => el.innerHTML`) controls how the `content` property returned by the `parse()` method is produced from the root DOM element. It may be useful to specify the `serializer` as the identity function (`el => el`) to obtain a DOM element instead of a string for `content` if you plan to process it further.
* `allowedVideoRegex` (RegExp, default `undefined` ): a regular expression that matches video URLs that should be allowed to be included in the article content. If `undefined`, the [default regex](https://github.com/mozilla/readability/blob/8e8ec27cd2013940bc6f3cc609de10e35a1d9d86/Readability.js#L133) is applied.
* `linkDensityModifier` (number, default `0`): a number that is added to the base link density threshold during the shadiness checks. This can be used to penalize nodes with a high link density or vice versa.
* `tagsToPreserve` (array, default `[]`): a set of tags to preserve on HTML elements additionally to the default set.

### `parse()`

Expand Down
7 changes: 6 additions & 1 deletion Readability.js
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ function Readability(doc, options) {
this._disableJSONLD = !!options.disableJSONLD;
this._allowedVideoRegex = options.allowedVideoRegex || this.REGEXPS.videos;
this._linkDensityModifier = options.linkDensityModifier || 0;
this._tagsToPreserve = options.tagsToPreserve || [];

// Start with all flags set
this._flags =
Expand Down Expand Up @@ -1156,7 +1157,11 @@ Readability.prototype = {
continue;
}

if (this.DEFAULT_TAGS_TO_SCORE.includes(node.tagName)) {
const tagsToScore = [
...this.DEFAULT_TAGS_TO_SCORE,
...this._tagsToPreserve,
];
if (tagsToScore.includes(node.tagName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of this is modeled on classesToPreserve, which functions as an allowlist for classes that are then not removed from the output (otherwise, all class information is removed).

From that, and from the documentation, and from the ticket you filed ("which can result in the loss of important semantic information and custom function") I would expect this new property to ensure that the content in these custom tags is always preserved. But the patch does not do that, it only scores those elements and they may yet get rejected if their score is too low.

So I think either the patch behaviour needs to be changed (so the tags always stay part of the output, which I suspect is non-trivial to organize), or the documentation/explanation/labeling of the property needs to be changed so that it matches the behaviour. Maybe calling it additionalTagsToScore would be better?

Finally, if you keep the current implementation (but perhaps rename the property), if we need to create a list of scorable tag names let's not do it on every iteration inside this loop, as that is expensive and this loop is hot (performance-sensitive) code.

elementsToScore.push(node);
}

Expand Down
1 change: 1 addition & 0 deletions index.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ export class Readability<T = string> {
serializer?: (node: Node) => T;
disableJSONLD?: boolean;
allowedVideoRegex?: RegExp;
tagsToPreserve?: string[];
}
);

Expand Down
24 changes: 24 additions & 0 deletions test/test-readability.js
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,14 @@ describe("Readability API", function () {
new Readability(doc, { allowedVideoRegex })._allowedVideoRegex
).eql(allowedVideoRegex);
});

it("should accept a tagsToPreserve option", function () {
expect(new Readability(doc)._tagsToPreserve).eql([]);
expect(
new Readability(doc, { tagsToPreserve: ["my-custom-tag"] })
._tagsToPreserve
).eql(["my-custom-tag"]);
});
});

describe("#parse", function () {
Expand Down Expand Up @@ -356,6 +364,22 @@ describe("Readability API", function () {
}).parse().content;
expect(content).eql(expected_xhtml);
});

it("should use custom tags to preserve sent as option", function () {
var dom = new JSDOM(
"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc mollis leo lacus, vitae semper nisl ullamcorper ut.</p>" +
"<my-custom-tag><p>My Custom Tag</p></my-custom-tag>"
);
var expected_xhtml =
'<div id="readability-page-1" class="page">' +
"<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc mollis leo lacus, vitae semper nisl ullamcorper ut.</p>" +
"<my-custom-tag><p>My Custom Tag</p></my-custom-tag>" +
"</div>";
var content = new Readability(dom.window.document, {
tagsToPreserve: ["my-custom-tag"],
}).parse().content;
expect(content).eql(expected_xhtml);
});
});
});

Expand Down