Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TeX input "splitting up" \mathXX{foo} #2595

Closed
pkra opened this issue Dec 30, 2020 · 12 comments
Closed

TeX input "splitting up" \mathXX{foo} #2595

pkra opened this issue Dec 30, 2020 · 12 comments
Labels
Accepted Issue has been reproduced by MathJax team Code Example Contains an illustrative code example, solution, or work-around Feature Request Fixed Test Needed v3 v3.1
Milestone

Comments

@pkra
Copy link
Contributor

pkra commented Dec 30, 2020

As per discussion with @zorkow, filing this here.

@davidmjones recently pointed out to me that (from a TeX perspective), things like \mathrm{foo} and \mathbf{foo} and \mathsf{foo}, etc., shouldn't be split up into individual characters.

Avoiding this would be a nice improvement for accessibility (and situations with missing glyphs and their shaping maybe as well).

@dpvc
Copy link
Member

dpvc commented Dec 30, 2020

I'm not sure I understand. In LaTeX, \mathrm{x+y} is displayed with the proper spacing around the +, for example, and that can't be done unless the characters are interpreted as math. My understanding is that \mathbf changes some fo the math fonts, but does not otherwise alter the way the math is processed. Of course \textrm{x+y} is treated as a single text item, and the plus will not have extra space around it. Perhaps that is what you want?

I also don't understand the comment about missing glyphs and their shaping. Sorry!

@davidmjones
Copy link

In my experience, your example is unrepresentative. (In my opinion, it's also a perverse, but my argument doesn't depend on that.) In a typical use of \mathrm, \mathit, \mathbf, \mathsf, etc., it's clear that the argument to the macro is meant to be treated as a single token. If I write

\mathrm{area} = \mathrm{length} \times \mathrm{width}

I expect it to be vocalized as "area equals length times width", not as "ay ar ee ay equals...". If I wanted that second reading, I would write

\mathrm{a}\mathrm{r}\mathrm{e}\mathrm{a} = ....

I can provide more examples from real documents if that would be helpful.

@dpvc
Copy link
Member

dpvc commented Jan 2, 2021

OK, @davidmjones, thanks for the additional information. I understand your viewpoint, and I believe you that many people use the macros in this way. I did some Googling and while the vast majority of uses are for a single letter where it doesn't matter, I do see this usage in the wild. But I also do see other usage, such as \mathbf {^{208}Pb}, \mathbf {Au+Au}, and \mathbf{v\times u} along with more reasonable things like \mathbf{\Gamma} and \mathbf{\hat x}. These all require processing the contents as math (as the name \mathbf indicates it will do). And while you may consider most of these to be "perverse", I can't think of another reasonable way to obtain \mathbf{\hat x} so that the hat is the proper bold one.

My suggestion, then, would be to make these macros check the contents, and if it is a string of letters (where what counts as a letter is configurable), then enclose it in an <mi>, otherwise process it as math as is currently done. That should mean that existing math will not break, but your use would produce the desired output.

In the meantime, you could use

\newcommand{\mathMi}[1]{\mmlToken{mi}[mathvariant=#1]}
\renewcommand{\mathrm}{\mathMi{normal}}
\renewcommand{\mathbf}{\mathMi{bold}}
\renewcommand{\mathsf}{\mathMi{sans-serif}}
...

or the equivalent in the macros list of the tex block in your MathJax configuration (or 'Macros' and 'TeX' in v2).

@davidmjones
Copy link

To my dismay, it's extremely hard to distill the point I want to make into a simple statement without getting bogged down in endless exceptions and special cases. So, in case it's not obvious already, let me state up front that I know there is no perfect solution when you try to assign semantics based on visual markup.

Nevertheless, I think there is a clear pattern of using these macros to encode identifiers based on natural language into, and I think it's important to try to preserve those semantics where possible for the benefit of the text-to-speech engine. I also think there are some relatively simple heuristics that would capture most of those patterns without breaking anything.

These all require processing the contents as math (as the name \mathbf
indicates it will do).

Right. I never meant to imply otherwise. In TeX, these macros only affect letters and digits, so those are the only characters that require special handling. [Technically, it applies to any character whose math class is 7, which also includes Greek symbols like \alpha, \Gamma, etc., but I don't think those should be treated as letters.]

BTW, that's why something like \mathrm{x + y} strikes me as perverse: It has no affect on the +. An author might write \mathbf{x + y} expecting the + to become bold, but they would be disappointed.

My suggestion, then, would be to make these macros check the contents, and if it is a string of letters (where what counts as a letter is configurable), then enclose it in an , otherwise process it as math as is currently done.

That would be a huge improvement, but my suggestion would be to combine into a single <mi> element any sequence of math Ord atoms with the following properties:

  1. They are contained in the argument of single \mathrm, \mathit, \mathsf, \mathtt, or \mathbf.
  2. The nucleus of each Ord atom consists of a Unicode letter.
  3. The subscript and superscript fields are empty.

This doesn't apply to things like \mathcal, \mathfrak or \mathbb since those are explicitly used to access distinct Mathematical Alphanumeric Symbols (to use the Unicode terminology), not a specific text font face. I left out \mathbfit because in the sample I looked at it's too rare to draw any conclusions about.

@dpvc
Copy link
Member

dpvc commented Jan 19, 2021

Here is a configuration that implements the proposal I made above:

MathJax = {
  tex: {packages: {'[+]': ['math-fonts']}},
  startup: {
    ready() {
      //
      //  These would be replaced by import commands if you wanted to make
      //  a proper extension.
      //
      const {Configuration} = MathJax._.input.tex.Configuration;
      const {CommandMap} = MathJax._.input.tex.SymbolMap;
      const BaseMethods = MathJax._.input.tex.base.BaseMethods.default;
      const TexParser = MathJax._.input.tex.TexParser.default;

      //
      //  Remap \mathrm, etc. to be able to create single <mi> elements
      //
      new CommandMap('math-fonts', {
        mathrm: ['MathFont', 'normal'],
        mathbf: ['MathFont', 'bold'],
        mathit: ['MathFont', '-tex-mathit'],  // internal variant for text italic font
        mathsf: ['MathFont', 'sans-serif'],
        mathtt: ['MathFont', 'monospace']
      }, {
        MathFont(parser, name, variant) {
          const text = parser.GetArgument(name);
          //
          //  Check if the argument is a string of letters only
          //     Make a single <mi> of them if so, otherwise
          //     Parse the argument as normal.
          //
          if (text.match(/^[a-z]+$/i)) {
            parser.Push(parser.create('token', 'mi', {mathvariant: variant}, text));
          } else {
            let mml = new TexParser(text, {...parser.stack.env, font: variant}, parser.configuration).mml();
            if (mml.isKind('inferredMrow')) {
              mml = parser.create('node', 'mrow', mml.childNodes);
            }
            parser.Push(mml);
          }
        }
      });
      Configuration.create('math-fonts', {
        handler: {macro: ['math-fonts']}
      });

      MathJax.startup.defaultReady();
    }
  }
}

in case you want to try that out.

Technically, it applies to any character whose math class is 7, which also includes Greek symbols like \alpha, \Gamma, etc.,

[Actually, I don't think it doesn't apply to the lower-case Greek letters, only the upper-case ones. Because I knew that that is how \mathbf works, and I know that + is not class 7, it makes sense to me to use \mathbf{x + y} and not expect the + to be in bold. But I understand that not everyone knows those details.]

my suggestion would be to combine into a single <mi> element any sequence of math Ord atoms with the following properties:...

MathJax's parsing does not produce TeX math lists, and so this characterization is not natural within MathJax. It would require significant changes to the parser to be able to accomplish it, and while one might suggest trying to combine nodes returned in the mml variable above, that would be a rather fragile approach, as there is no indication of where the nodes came from, or if there was any spacing, etc. So \mathbf{x y} would produce <mi>xy</mi> which seems inappropriate.

Can you give an example of where your algorithm would be needed (in place of the one I give above)?

This doesn't apply to things like \mathcal, \mathfrak or \mathbb ...

I'm a bit concerned about the inconsistency of having some of these macros combine characters into one <mi> and other not. I'm not sure I buy the argument about the Math Alphanumerics block, because MathML doesn't have separate text and math fonts. That is, <mi mathvariant="bold">A</mi> is supposed to be treated identically to <mi>&#x1D400;</mi>, and so when \mathbf{ABC} produces <mi mathvariant="bold">ABC</mi>, it is also producing values in the Math Alphanumeric block. Why should that be different for any other characters in that block. Why shouldn't \mathbb{ABC} produce <mi mathvariant="double-struck">ABC</mi> which is equivalent to <mi>&#x1D538;&#x1D539;&#x2102;</mi> (since double-struck C is in the Letterlike Symbols block, not the Math Alphanumerics)?

TeX doesn't have a separate "math bold" and "text bold" font (they are labeled "text fonts" in Appendix F of the TeXbook, so I guess Knuth considered them text fonts); the only distinction is between math italics (cmmi) and text italics (cmit). While Unicode does have a distinction (the text font being in the usual ASCII range and the math font in the Math Alphanumerics block), MathML doesn't give a natural means of accessing the text versions (as I describe above). So while TeX thinks of bold as a text font, MathML thinks of it as a math font.

Similarly, MathJax doesn't have separate text and math fonts, except for italics, where it uses a special internal math variant to handle the text italics. So when you use \mathbf{ABC} you will be getting the Math Alphanumeric versions.

@davidmjones
Copy link

Coincidentally, I've been rereading the unicode-math documentation, which I had forgotten spends quite a bit of space discussing exactly these issues. See especially sections 3.1 and 4.4, but also parts of section 5 in http://mirrors.ctan.org/macros/unicodetex/latex/unicode-math/unicode-math.pdf.

Technically, it applies to any character whose math class is 7, which also includes Greek symbols like \alpha, \Gamma, etc.,

[Actually, I don't think it doesn't apply to the lower-case Greek letters, only the upper-case ones.

Yes, listing \alpha was a mistake.

Can you give an example of where your algorithm would be needed (in place of the one I give above)?

Needed? No, as long as the user knows what they are doing and uses the commands carefully, I think your solution is probably sufficient.

This doesn't apply to things like \mathcal, \mathfrak or \mathbb ...

I'm a bit concerned about the inconsistency of having some of these macros combine characters into one <mi> and other not.

Yup. It's a mess. Damn Knuth for not anticipating Unicode when he designed TeX and the Computer Modern fonts in the mid 70s. :)

Similarly, MathJax doesn't have separate text and math fonts, except for italics, where it uses a special internal math variant to handle the text italics. So when you use \mathbf{ABC} you will be getting the Math Alphanumeric versions.

Interesting. I don't think I knew that. FWIW, here's what various alphabets give you by default:

\documentclass{article}

\usepackage{unicode-math}
\setmainfont{STIX Two Text}
\setmathfont{STIX Two Math}

\loggingoutput

\begin{document}

\textbf{a}          % STIXTwoText U+0061

$a$                 % STIXTwoMath U+1D44E

$\mathrm{a}$        % STIXTwoMath U+0061

$\mathbf{a}$        % cmbx10 U+0061 (surely a bug)

$\symbf{a}$         % STIXTwoMath U+1D41A

\end{document}

Like I said, it's a mess.

@davidmjones
Copy link

Can you give an example of where your algorithm would be needed (in place of the one I give above)?

Needed? No, as long as the user knows what they are doing and uses the commands carefully, I think your solution is probably sufficient.

I shouldn't have folded so quickly. Originally I had a couple of cases in mind. First, something like

\mathrm{area = length \times width}

That's easily taken care of by recoding it the way I originally coded it above.

Since I'm monolingual and the AMS publishes almost exclusively in English, I can't come up with any examples of the other use case, but I can imagine someone wanting to using an identifier with letters outside of the ASCII range. I think that's what @pkra had in mind when he mentioned missing glyphs and shaping at the top.

@dpvc
Copy link
Member

dpvc commented Jan 21, 2021

$\mathbf{a}$ % cmbx10 U+0061 (surely a bug)

Of course, that was the one I really wanted to see. :-)

I've been rereading the unicode-math documentation

That link was very useful, thank you. I'm thinking about how best to incorporate that information into MathJax.

I can imagine someone wanting to using an identifier with letters outside of the ASCII range

Absolutely. I hard coded the pattern in the example above, but it would be a configurable value if it is to be included in Mathjax itself, so those using other languages could include the characters they need.

@davidmjones
Copy link

$\mathbf{a}$ % cmbx10 U+0061 (surely a bug)

Of course, that was the one I really wanted to see. :-)

To be clear, the bug is that it was using cmbx10, not that it was mapping the character to U+0061; that's the expected behaviour from the documentation.

It did inspire me to make a more thorough catalog of the math alphabets supported by the unicode-math package, though. Here's the result: mathalpha.pdf. It makes for interesting if somewhat maddening reading.

I've been rereading the unicode-math documentation

That link was very useful, thank you. I'm thinking about how best to incorporate that information into MathJax.

@pkra and I have been working on a MathJax extension to support the unicode-math package. It's in a private repo at the moment, but we hope to make a beta version public soon. Maybe that would be a good place to experiment with the math alphabet support?

@dpvc
Copy link
Member

dpvc commented Jan 28, 2021

Volker pointed out to me a suggestion for how to do something more like what you have suggested in terms of grouping multiple letters together. Here is an implementation for that:

MathJax = {
  tex: {packages: {'[+]': ['math-fonts']}},
  startup: {
    ready() {
      //
      //  These would be replaced by import commands if you wanted to make
      //  a proper extension.
      //
      const {Configuration} = MathJax._.input.tex.Configuration;
      const {CommandMap, RegExpMap} = MathJax._.input.tex.SymbolMap;
      const TexParser = MathJax._.input.tex.TexParser.default;
      const ParseMethods = MathJax._.input.tex.ParseMethods.default;

      new RegExpMap('multi-letter', function (parser, c) {
        if (parser.stack.env.multiLetterIdentifiers) {
          c = parser.string.substr(parser.i-1).match(/^[a-z]+/i)[0];
        }
        ParseMethods.variable(parser, c);
        parser.i += c.length - 1;
      }, /[a-z]/i);

      new CommandMap('math-fonts', {
        mathrm: ['MathFont', 'normal'],
        mathbf: ['MathFont', 'bold'],
        mathit: ['MathFont', '-tex-mathit'],  // internal variant for text italic font
        mathsf: ['MathFont', 'sans-serif'],
        mathtt: ['MathFont', 'monospace']
      }, {
        MathFont(parser, name, variant) {
          const text = parser.GetArgument(name);
          const old = parser.stack.env.multiLetterIdentifiers;
          parser.stack.env.multiLetterIdentifiers = true;
          let mml = new TexParser(text, {...parser.stack.env, font: variant}, parser.configuration).mml();
          if (!old) {
            delete parser.stack.env.multiLetterIdentifiers;
          }
          if (mml.isKind('inferredMrow')) {
            mml = parser.create('node', 'mrow', mml.childNodes);
          }
          parser.Push(mml);
        }
      });
      const mathFonts = Configuration.create('math-fonts', {
        handler: {
          character: ['multi-letter'],
          macro: ['math-fonts'],
        }
      });

      MathJax.startup.defaultReady();
    }
  }
}

This adds a character map that (conditionally) turns on multi-character identifiers, so that within \mathbf{} and the others, multiple letters will be combined into a single identifier, while still processing everything else as normal. So \mathrm{area = length \times width} would produce

<mi mathvariant="normal">area</mi>
<mo mathvariant="normal">=</mo>
<mi mathvariant="normal">length</mi>
<mo mathvariant="normal">&#xD7;</mo>
<mi mathvariant="normal">width</mi>

This is slightly different from what you suggest, in that \mathrm{inch^3} will produce

<msup>
  <mi mathvariant="normal">inch</mi>
  <mn>3</mn>
</msup>

rather than the

<mi mathvariant="normal">inc</mi>
<msup>
  <mi mathvariant="normal">h</mi>
  <mn>3</mn>
</msup>

that your algorithm would produce, and something like

\newcommand{\a}{a}
\mathbf{a\a}

will produce

<mi mathvariant="normal">a</mi>
<mi mathvariant="normal">a</mi>

rather than

<mi mathvariant="normal">aa</mi>

that your approach would produce.

Anyway, it turns out that your area example can be handled reasonably.

@davidmjones
Copy link

Thank you for this. I haven't had a chance to take a close look at it or try it out yet, but I wanted to comment on this part:

This is slightly different from what you suggest, in that \mathrm{inch^3} will produce

<msup>
  <mi mathvariant="normal">inch</mi>
  <mn>3</mn>
</msup>

rather than the

<mi mathvariant="normal">inc</mi>
<msup>
  <mi mathvariant="normal">h</mi>
  <mn>3</mn>
</msup>

that your algorithm would produce,

If that's what my algorithm would produce, my algorithm was clearly wrong.

and something like

\newcommand{\a}{a}
\mathbf{a\a}

will produce

<mi mathvariant="normal">a</mi>
<mi mathvariant="normal">a</mi>

rather than

<mi mathvariant="normal">aa</mi>

that your approach would produce.

Fair enough. That's weird enough that I'm not too worried about how it comes out.

dpvc added a commit to mathjax/MathJax-src that referenced this issue Mar 31, 2021
…e multi-letter <mi> elements that are not auto-converted to OP elements. (mathjax/MathJax#2595)
@dpvc dpvc added Accepted Issue has been reproduced by MathJax team Ready for Review Test Needed v3 labels Mar 31, 2021
@dpvc dpvc added this to the 3.1.3 milestone Mar 31, 2021
@dpvc
Copy link
Member

dpvc commented Mar 31, 2021

I've made a PR to implement the solution above, and added the remaining \math* and the \sym* macros. This allows easy access from TeX to all the MathML variants, which was not the case before.

@dpvc dpvc added the Code Example Contains an illustrative code example, solution, or work-around label Apr 1, 2021
dpvc added a commit to mathjax/MathJax-src that referenced this issue Apr 20, 2021
Add support for all \mathXYZ and \symXYZ macros using multi-letter <mi>.  (mathjax/MathJax#2595)
@dpvc dpvc added Merged Merged into develop branch and removed Ready for Review labels Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Issue has been reproduced by MathJax team Code Example Contains an illustrative code example, solution, or work-around Feature Request Fixed Test Needed v3 v3.1
Projects
None yet
Development

No branches or pull requests

3 participants