You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To determine an article's title1, the first thing that Readability does is to attempt to split the document's <title> by a separator and keep the first part:
This is a good heuristic and seems to extract the right title for a variety of documents. However the separators only cover the following regex: / [\|\-\\\/>»] /. This means that the following separator patterns are covered:
Title | Site name
Title - Site name
Title \ Site name
Title > Site name
Title » Site name
Goal
The heuristic described above leaves out some other relatively common separators. The ones I'm seeing most often are en and em dashes:
Title – Site name
Title — Site name
I'd like to add support for these two separators. This would be a two-line change and wouldn't change any of the existing logic.
Your weekly Web Directions reading – Web Directions (article)
Partisan Politics and the Road to Plutocracy – Economics from the Top Down (article)
Pluralistic: Premature Internet Activists (13 Feb 2025) – Pluralistic: Daily links from Cory Doctorow (article)3
Can I open a PR?
Non-goals (unless you want them to be)
Handling cases where em dashes are used as separators without spaces: Title—Site name. This is relatively common (example 1, example 2), but it would require treating an em dash differently from all other separators (a hyphen without spaces around cannot be treated as a separator, it would split hyphenated words). If you don't mind the extra complexity, I'm happy to cover this as well.
Handling less common separators. For example, I've seen various types of interpuncts being used (·, •, ∙, ◦, etc.), but I don't expect Readability to cover all of these since the list could become endless (though it could consider covering the most common, perhaps · and •—I leave this up to you).
These are all using en dashes but I think including em dashes makes sense as well. There may not be many English-language documents using em dashes with spaces around them, but that is a common thing in French for example. ↩
Readability currently thinks that the title here is Daily links from Cory Doctorow, which is the publication name. This happens because of the semicolons: if Readability doesn't find a separator but finds a semicolon instead, it takes whatever is after the last semicolon. ↩
The text was updated successfully, but these errors were encountered:
To determine an article's title1, the first thing that Readability does is to attempt to split the document's
<title>
by a separator and keep the first part:readability/Readability.js
Lines 595 to 598 in 118f015
This is a good heuristic and seems to extract the right title for a variety of documents. However the separators only cover the following regex:
/ [\|\-\\\/>»] /
. This means that the following separator patterns are covered:Title | Site name
Title - Site name
Title \ Site name
Title > Site name
Title » Site name
Goal
The heuristic described above leaves out some other relatively common separators. The ones I'm seeing most often are en and em dashes:
Title – Site name
Title — Site name
I'd like to add support for these two separators. This would be a two-line change and wouldn't change any of the existing logic.
Here are example titles that this would fix:2
Your weekly Web Directions reading – Web Directions
(article)Partisan Politics and the Road to Plutocracy – Economics from the Top Down
(article)Pluralistic: Premature Internet Activists (13 Feb 2025) – Pluralistic: Daily links from Cory Doctorow
(article)3Can I open a PR?
Non-goals (unless you want them to be)
Title—Site name
. This is relatively common (example 1, example 2), but it would require treating an em dash differently from all other separators (a hyphen without spaces around cannot be treated as a separator, it would split hyphenated words). If you don't mind the extra complexity, I'm happy to cover this as well.·
,•
,∙
,◦
, etc.), but I don't expect Readability to cover all of these since the list could become endless (though it could consider covering the most common, perhaps·
and•
—I leave this up to you).Footnotes
Unless there is explicit title metadata in the document: https://github.com/mozilla/readability/blob/118f01538e167218bd86ffd493bd3466aec4870a/Readability.js#L1813-L1815 ↩
These are all using en dashes but I think including em dashes makes sense as well. There may not be many English-language documents using em dashes with spaces around them, but that is a common thing in French for example. ↩
Readability currently thinks that the title here is
Daily links from Cory Doctorow
, which is the publication name. This happens because of the semicolons: if Readability doesn't find a separator but finds a semicolon instead, it takes whatever is after the last semicolon. ↩The text was updated successfully, but these errors were encountered: