Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Can newlines in multi-line literals be normalized? #835

Closed
hukkin opened this issue Jul 10, 2021 · 18 comments · Fixed by #842
Closed

Question: Can newlines in multi-line literals be normalized? #835

hukkin opened this issue Jul 10, 2021 · 18 comments · Fixed by #842

Comments

@hukkin
Copy link
Contributor

hukkin commented Jul 10, 2021

The spec only speaks of newline normalization in the context of multi-line basic strings and is in fact quite literal about the fact that "content between the [multi-line literal] delimiters is interpreted as-is without modification".

I'm not an expert on reading ABNF, but the ABNF spec says

mll-content = mll-char / newline

that is, it does not include newline in mll-char (multi-line literal char), making me think perhaps the normalization is, nonetheless, allowed in multi-line literals as well.

I'd much appreciate clarification/someone else's interpretation.

@arp242
Copy link
Contributor

arp242 commented Jul 11, 2021

What exactly do you mean with "normalized"?

ABNF only describes the syntax (that it, what characters are allowed), and doesn't really describe the semantics (i.e. how it's interpreted or processed).

@hukkin
Copy link
Contributor Author

hukkin commented Jul 11, 2021

Normalize as in convert CRLF to LF or vice versa.

@arp242
Copy link
Contributor

arp242 commented Jul 11, 2021

Oh right. That's basically just up to the implementation; from https://toml.io/en/v1.0.0 :

TOML parsers should feel free to normalize newline to whatever makes sense for their platform.

@eksortso
Copy link
Contributor

eksortso commented Jul 11, 2021

@hukkin Good catch. Definitely a subtle point.

My personal take is that what matters most about newlines in MLL strings isn't how they appear in the document but how the resulting string represents them, for two reasons.

Primarily, aside from newline characters, the only control character permitted in an MLL string is the tab. So fine control over the special characters \n and \r ought to be better suited to basic string use, where they can be expressed clearly.

Second, as noted, the precedent for newline normalization is set in the Multiline Basic string definition, which should be repeated in the MLL string definition:

TOML parsers should feel free to normalize newline to whatever makes sense for their platform.

I'm willing to draft a PR to set this straight.

@hukkin
Copy link
Contributor Author

hukkin commented Jul 11, 2021

I'm willing to draft a PR to set this straight.

Would be great, thanks! Should we make the sentence I quoted in the first post (that says "as-is without modification" etc.) mention this exception?

Now that I have your attention 😄 I've got another question related to newline normalization: is a parser allowed to normalize more than once? I'll try to explain what this means with an example where I use \r and \n to denote literal CR and LF characters:

str = '''Here's a newline\r\r\n'''

A parser is allowed to convert the \r\n to \n. We can express the normalized string with the following TOML:

str = '''Here's a newline\r\n'''

And on second parse we're allowed to parse

str = '''Here's a newline\n'''

So yeah the question is, can a parser add/remove an arbitrary amount of CR characters to prefix an LF character. Or is it only allowed to add/remove one? Could or should the spec take a stance in this? Maybe it would be easiest to redefine newline using the regex \r*\n or alternatively allow newline be one of \r (not followed by line feed), \n or \r\n (this is how e.g. CommonMark defines line ending)?

@hukkin
Copy link
Contributor Author

hukkin commented Jul 11, 2021

Oh also, if we allow newline normalization in multi-line literals, I think we may as well generalize it so that newline normalization is allowed globally in the entire TOML document?

@eksortso
Copy link
Contributor

I'm willing to draft a PR to set this straight.

Would be great, thanks! Should we make the sentence I quoted in the first post (that says "as-is without modification" etc.) mention this exception?

I'll need to think about that.

Now that I have your attention 😄 I've got another question related to newline normalization: is a parser allowed to normalize more than once? I'll try to explain what this means with an example where I use \r and \n to denote literal CR and LF characters:

str = '''Here's a newline\r\r\n'''

Gotta stop you there. This example is invalid. The last three characters inside the delimiters are two literal CRs followed by a literal LF... but that first CR is not permitted. It's not part of a CRLF, it isn't a newline on its own, and aside from tabs, no other control code is allowed.

[...]

So yeah the question is, can a parser add/remove an arbitrary amount of CR characters to prefix an LF character. Or is it only allowed to add/remove one? Could or should the spec take a stance in this? Maybe it would be easiest to redefine newline using the regex \r*\n?

Every CR in a literal string must be followed by a LF. So normalization would only happen once at most.

@eksortso
Copy link
Contributor

Oh also, if we allow newline normalization in multi-line literals, I think we may as well generalize it so that newline normalization is allowed globally in the entire TOML document?

That doesn't make sense. The only places within a TOML document where a physical newline is converted into a value containing a newline are multiline strings. Nowhere else would a parser be called upon to normalize newlines.

Arrays may have newlines between values, but those newlines aren't part of the containing array when decoded.

@hukkin
Copy link
Contributor Author

hukkin commented Jul 11, 2021

Gotta stop you there. This example is invalid.

Oh of course, thanks! But what about the same example but with a multi-line basic string? CR is allowed there if I'm not wrong. I'd interpret it so that e.g. in """newline:\r\r\n""" the first \r is not part of a newline, so can't be removed/normalized. But that feels wrong: one would be forced to replace newlines with escaped newlines when roundtripping, or else roundtrip parsing can one by one eat all the \r characters on following passes and change semantic of the string. EDIT: .abnf prohibits CRs in MLBs so this is not a concern

That doesn't make sense. The only places within a TOML document where a physical newline is converted into a value containing a newline are multiline strings. Nowhere else would a parser be called upon to normalize newlines.

Yeah perhaps you're right. The sense to me would be

  1. simpler spec: Only need to mention normalization once (as opposed to once for MLL and once for MLB).
  2. simpler implementation: Makes it obvious that an implementation shortcut where one does a global replace of \r\n with \n before parsing is allowed. No need to worry about CRLF after that. This is what pytoml does and what I've been doing in Tomli (although realized it's probably not allowed to do so in TOML v1.0.0 due to the fact that MLLs are interpreted "as is", without newline normalization).

@hukkin
Copy link
Contributor Author

hukkin commented Jul 30, 2021

I'd much appreciate a maintainer's take on this issue (maybe @pradyunsg ?). Currently my parser does newline normalization in multi-line literals which strictly speaking is a spec violation in TOML v1.0.0. But if the spec is about to change in this regard, then I'd prefer to keep the bug and wait until the spec is in line with it.

@pradyunsg
Copy link
Member

pradyunsg commented Aug 11, 2021

Sorry, I've not had time to look at this properly before today. I've tried to figure out how OP I reading the specification, to determine that newline normalization is not permitted, at least twice before and failed to understand how they're parsing the prose in the specification.

I'm not an expert on reading ABNF, but the ABNF spec says

An ABNF grammar describes what inputs to a parser are acceptable (ignoring semantics). What those inputs translate into as values is not something that it covers -- that is described in the prose.


To answer the question in OP, yes.

I don't think we should make newlines (which are invisible characters) significant somehow. If some parsers want to not normalize them, they're free to. Same is true for the other way round though.

What a TOML parser does with newlines in a multi-line string is their prerogative. I think that the language in the specification says that parsers are free to do whatever they want w.r.t. newline normalization.

Given a process of unnormalised-text -> unnormalised-parsed -> normalised-text -> normalised-parsed; if both the parsed values look the same, there's absolutely no harm in normalising the content. Whether a encoder-decoder pair care about the potential for mismatch between unnormalised-text and normalised-text is a decision for the parser author to make. I don't see any reason what we'd want to disallow or require either approach. I expect most parser authors will take the approach that's easier in their language (eg: using Python's built in normalisation).

Note that this is about CRLF in the document, NOT \r\n within the content, which MUST be preserved.

I'm happy to accept a PR clarifying this and amending the prose to make this clearer. I personally don't know how to do that, because I don't understand OP's source of confusion.

@pradyunsg
Copy link
Member

pradyunsg commented Aug 11, 2021

Given:

str = '''Here's a newline{CR}{CR}{LF}'''

This is currently invalid, as @eksortso noted already.

Assuming #837 happens and you're normalising newlines, you'd parse this as the JSON:

{"str": "Here's a newline\n\n"}

It would be invalid to write this as:

str = '''Here's a newline{CR}{LF}'''

That's because you've changed the value. Reparsing this (again, with normalisation) would yield:

{"str": "Here's a newline\n"}

Notice how they're different? There's a missing newline, and you've changed the semantic meaning of the file.


Valid ways to write the parsed value would be:

str = '''Here's a newline{CR}{CR}{LF}'''
str = '''Here's a newline{CR}{LF}{CR}{LF}'''
str = '''Here's a newline{LF}{LF}'''
str = '''Here's a newline{CR}{CR}'''

I don't care which form the parsers choose. That's a flexibility the specification already provides the parsers. That said, two and three are probably what most parsers will spew out and what text editors (when configured to do so) will normalise newlines to. All of them parse the same when renormalised.

@hukkin
Copy link
Contributor Author

hukkin commented Aug 11, 2021

I don't understand OP's source of confusion.

The source of confusion is that newline normalisation is only mentioned in the context of multi-line basic strings.

When it comes to multi-line literal strings, no newline normalisation is mentioned in the spec. And further to this, the spec says "content between the [multi-line literal] delimiters is interpreted as-is without modification" implying that newline normalisation or any other modifications are prohibited.

So even if you say that newline normalisation is allowed in multi-line literals (and I agree it should be), I'd argue that current TOML spec (v1.0.0) does not agree with us.

@pradyunsg
Copy link
Member

Oh! I think I see it now! You're reading that to include {CR}/{LF} characters within "content" and I'm not.

Let's tweak the wording, to avoid the possibility of reading it that way.

@pradyunsg
Copy link
Member

pradyunsg commented Aug 11, 2021

BTW @hukkin, I believe we're on at least one common Slack (EBP) and you have a mention there. :)

@hukkin
Copy link
Contributor Author

hukkin commented Aug 15, 2021

Went ahead and took a stab at this #842 😸

@eksortso
Copy link
Contributor

Sorry for vanishing lately. Life's been in the way. @hukkin I'll check your PR out tomorrow. Give me 24 hours.

@eksortso
Copy link
Contributor

I left some notes on the PR. My preference is for @pradyunsg's language though. Newline normalization ought to be consistent between MLB strings and MLL strings, and it's made clear and definitive in MLL strings how newlines are treated differently from the rest of the content of the string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants