Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle en dashes (–) and em dashes (—) in titles #954

Open
robinmetral opened this issue Feb 15, 2025 · 0 comments
Open

Handle en dashes (–) and em dashes (—) in titles #954

robinmetral opened this issue Feb 15, 2025 · 0 comments

Comments

@robinmetral
Copy link

To determine an article's title1, the first thing that Readability does is to attempt to split the document's <title> by a separator and keep the first part:

readability/Readability.js

Lines 595 to 598 in 118f015

// If there's a separator in the title, first remove the final part
if (/ [\|\-\\\/>»] /.test(curTitle)) {
titleHadHierarchicalSeparators = / [\\\/>»] /.test(curTitle);
curTitle = origTitle.replace(/(.*)[\|\-\\\/>»] .*/gi, "$1");

This is a good heuristic and seems to extract the right title for a variety of documents. However the separators only cover the following regex: / [\|\-\\\/>»] /. This means that the following separator patterns are covered:

  • Title | Site name
  • Title - Site name
  • Title \ Site name
  • Title > Site name
  • Title » Site name

Goal

The heuristic described above leaves out some other relatively common separators. The ones I'm seeing most often are en and em dashes:

  • Title – Site name
  • Title — Site name

I'd like to add support for these two separators. This would be a two-line change and wouldn't change any of the existing logic.

Here are example titles that this would fix:2

  • Your weekly Web Directions reading – Web Directions (article)
  • Partisan Politics and the Road to Plutocracy – Economics from the Top Down (article)
  • Pluralistic: Premature Internet Activists (13 Feb 2025) – Pluralistic: Daily links from Cory Doctorow (article)3

Can I open a PR?

Non-goals (unless you want them to be)

  • Handling cases where em dashes are used as separators without spaces: Title—Site name. This is relatively common (example 1, example 2), but it would require treating an em dash differently from all other separators (a hyphen without spaces around cannot be treated as a separator, it would split hyphenated words). If you don't mind the extra complexity, I'm happy to cover this as well.
  • Handling less common separators. For example, I've seen various types of interpuncts being used (·, , , , etc.), but I don't expect Readability to cover all of these since the list could become endless (though it could consider covering the most common, perhaps &middot; and &bullet;—I leave this up to you).

Footnotes

  1. Unless there is explicit title metadata in the document: https://github.com/mozilla/readability/blob/118f01538e167218bd86ffd493bd3466aec4870a/Readability.js#L1813-L1815

  2. These are all using en dashes but I think including em dashes makes sense as well. There may not be many English-language documents using em dashes with spaces around them, but that is a common thing in French for example.

  3. Readability currently thinks that the title here is Daily links from Cory Doctorow, which is the publication name. This happens because of the semicolons: if Readability doesn't find a separator but finds a semicolon instead, it takes whatever is after the last semicolon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant