Question: Can newlines in multi-line literals be normalized? #835

hukkin · 2021-07-10T20:07:20Z

The spec only speaks of newline normalization in the context of multi-line basic strings and is in fact quite literal about the fact that "content between the [multi-line literal] delimiters is interpreted as-is without modification".

I'm not an expert on reading ABNF, but the ABNF spec says

mll-content = mll-char / newline

that is, it does not include newline in mll-char (multi-line literal char), making me think perhaps the normalization is, nonetheless, allowed in multi-line literals as well.

I'd much appreciate clarification/someone else's interpretation.

The text was updated successfully, but these errors were encountered:

arp242 · 2021-07-11T00:27:44Z

What exactly do you mean with "normalized"?

ABNF only describes the syntax (that it, what characters are allowed), and doesn't really describe the semantics (i.e. how it's interpreted or processed).

hukkin · 2021-07-11T00:30:26Z

Normalize as in convert CRLF to LF or vice versa.

arp242 · 2021-07-11T00:40:18Z

Oh right. That's basically just up to the implementation; from https://toml.io/en/v1.0.0 :

TOML parsers should feel free to normalize newline to whatever makes sense for their platform.

eksortso · 2021-07-11T00:47:35Z

@hukkin Good catch. Definitely a subtle point.

My personal take is that what matters most about newlines in MLL strings isn't how they appear in the document but how the resulting string represents them, for two reasons.

Primarily, aside from newline characters, the only control character permitted in an MLL string is the tab. So fine control over the special characters \n and \r ought to be better suited to basic string use, where they can be expressed clearly.

Second, as noted, the precedent for newline normalization is set in the Multiline Basic string definition, which should be repeated in the MLL string definition:

TOML parsers should feel free to normalize newline to whatever makes sense for their platform.

I'm willing to draft a PR to set this straight.

hukkin · 2021-07-11T01:10:36Z

I'm willing to draft a PR to set this straight.

Would be great, thanks! Should we make the sentence I quoted in the first post (that says "as-is without modification" etc.) mention this exception?

Now that I have your attention 😄 I've got another question related to newline normalization: is a parser allowed to normalize more than once? I'll try to explain what this means with an example where I use \r and \n to denote literal CR and LF characters:

str = '''Here's a newline\r\r\n'''

A parser is allowed to convert the \r\n to \n. We can express the normalized string with the following TOML:

str = '''Here's a newline\r\n'''

And on second parse we're allowed to parse

str = '''Here's a newline\n'''

So yeah the question is, can a parser add/remove an arbitrary amount of CR characters to prefix an LF character. Or is it only allowed to add/remove one? Could or should the spec take a stance in this? Maybe it would be easiest to redefine newline using the regex \r*\n or alternatively allow newline be one of \r (not followed by line feed), \n or \r\n (this is how e.g. CommonMark defines line ending)?

hukkin · 2021-07-11T01:27:24Z

Oh also, if we allow newline normalization in multi-line literals, I think we may as well generalize it so that newline normalization is allowed globally in the entire TOML document?

eksortso · 2021-07-11T01:35:49Z

I'm willing to draft a PR to set this straight.

Would be great, thanks! Should we make the sentence I quoted in the first post (that says "as-is without modification" etc.) mention this exception?

I'll need to think about that.

Now that I have your attention 😄 I've got another question related to newline normalization: is a parser allowed to normalize more than once? I'll try to explain what this means with an example where I use \r and \n to denote literal CR and LF characters:
str = '''Here's a newline\r\r\n'''

Gotta stop you there. This example is invalid. The last three characters inside the delimiters are two literal CRs followed by a literal LF... but that first CR is not permitted. It's not part of a CRLF, it isn't a newline on its own, and aside from tabs, no other control code is allowed.

[...]

So yeah the question is, can a parser add/remove an arbitrary amount of CR characters to prefix an LF character. Or is it only allowed to add/remove one? Could or should the spec take a stance in this? Maybe it would be easiest to redefine newline using the regex \r*\n?

Every CR in a literal string must be followed by a LF. So normalization would only happen once at most.

eksortso · 2021-07-11T04:41:57Z

Oh also, if we allow newline normalization in multi-line literals, I think we may as well generalize it so that newline normalization is allowed globally in the entire TOML document?

That doesn't make sense. The only places within a TOML document where a physical newline is converted into a value containing a newline are multiline strings. Nowhere else would a parser be called upon to normalize newlines.

Arrays may have newlines between values, but those newlines aren't part of the containing array when decoded.

hukkin · 2021-07-11T07:37:47Z

Gotta stop you there. This example is invalid.

Oh of course, thanks! But what about the same example but with a multi-line basic string? CR is allowed there if I'm not wrong. I'd interpret it so that e.g. in """newline:\r\r\n""" the first \r is not part of a newline, so can't be removed/normalized. But that feels wrong: one would be forced to replace newlines with escaped newlines when roundtripping, or else roundtrip parsing can one by one eat all the \r characters on following passes and change semantic of the string. EDIT: .abnf prohibits CRs in MLBs so this is not a concern

That doesn't make sense. The only places within a TOML document where a physical newline is converted into a value containing a newline are multiline strings. Nowhere else would a parser be called upon to normalize newlines.

Yeah perhaps you're right. The sense to me would be

simpler spec: Only need to mention normalization once (as opposed to once for MLL and once for MLB).
simpler implementation: Makes it obvious that an implementation shortcut where one does a global replace of \r\n with \n before parsing is allowed. No need to worry about CRLF after that. This is what pytoml does and what I've been doing in Tomli (although realized it's probably not allowed to do so in TOML v1.0.0 due to the fact that MLLs are interpreted "as is", without newline normalization).

hukkin · 2021-07-30T07:46:11Z

I'd much appreciate a maintainer's take on this issue (maybe @pradyunsg ?). Currently my parser does newline normalization in multi-line literals which strictly speaking is a spec violation in TOML v1.0.0. But if the spec is about to change in this regard, then I'd prefer to keep the bug and wait until the spec is in line with it.

pradyunsg · 2021-08-11T18:23:54Z

Sorry, I've not had time to look at this properly before today. I've tried to figure out how OP I reading the specification, to determine that newline normalization is not permitted, at least twice before and failed to understand how they're parsing the prose in the specification.

I'm not an expert on reading ABNF, but the ABNF spec says

An ABNF grammar describes what inputs to a parser are acceptable (ignoring semantics). What those inputs translate into as values is not something that it covers -- that is described in the prose.

To answer the question in OP, yes.

I don't think we should make newlines (which are invisible characters) significant somehow. If some parsers want to not normalize them, they're free to. Same is true for the other way round though.

What a TOML parser does with newlines in a multi-line string is their prerogative. I think that the language in the specification says that parsers are free to do whatever they want w.r.t. newline normalization.

Given a process of unnormalised-text -> unnormalised-parsed -> normalised-text -> normalised-parsed; if both the parsed values look the same, there's absolutely no harm in normalising the content. Whether a encoder-decoder pair care about the potential for mismatch between unnormalised-text and normalised-text is a decision for the parser author to make. I don't see any reason what we'd want to disallow or require either approach. I expect most parser authors will take the approach that's easier in their language (eg: using Python's built in normalisation).

Note that this is about CRLF in the document, NOT \r\n within the content, which MUST be preserved.

I'm happy to accept a PR clarifying this and amending the prose to make this clearer. I personally don't know how to do that, because I don't understand OP's source of confusion.

pradyunsg · 2021-08-11T18:33:54Z

Given:

str = '''Here's a newline{CR}{CR}{LF}'''

This is currently invalid, as @eksortso noted already.

Assuming #837 happens and you're normalising newlines, you'd parse this as the JSON:

{"str": "Here's a newline\n\n"}

It would be invalid to write this as:

str = '''Here's a newline{CR}{LF}'''

That's because you've changed the value. Reparsing this (again, with normalisation) would yield:

{"str": "Here's a newline\n"}

Notice how they're different? There's a missing newline, and you've changed the semantic meaning of the file.

Valid ways to write the parsed value would be:

str = '''Here's a newline{CR}{CR}{LF}'''

str = '''Here's a newline{CR}{LF}{CR}{LF}'''

str = '''Here's a newline{LF}{LF}'''

str = '''Here's a newline{CR}{CR}'''

I don't care which form the parsers choose. That's a flexibility the specification already provides the parsers. That said, two and three are probably what most parsers will spew out and what text editors (when configured to do so) will normalise newlines to. All of them parse the same when renormalised.

hukkin · 2021-08-11T18:40:14Z

I don't understand OP's source of confusion.

The source of confusion is that newline normalisation is only mentioned in the context of multi-line basic strings.

When it comes to multi-line literal strings, no newline normalisation is mentioned in the spec. And further to this, the spec says "content between the [multi-line literal] delimiters is interpreted as-is without modification" implying that newline normalisation or any other modifications are prohibited.

So even if you say that newline normalisation is allowed in multi-line literals (and I agree it should be), I'd argue that current TOML spec (v1.0.0) does not agree with us.

pradyunsg · 2021-08-11T18:46:06Z

Oh! I think I see it now! You're reading that to include {CR}/{LF} characters within "content" and I'm not.

Let's tweak the wording, to avoid the possibility of reading it that way.

pradyunsg · 2021-08-11T18:52:32Z

BTW @hukkin, I believe we're on at least one common Slack (EBP) and you have a mention there. :)

hukkin · 2021-08-15T09:13:19Z

Went ahead and took a stab at this #842 😸

eksortso · 2021-08-16T04:00:56Z

Sorry for vanishing lately. Life's been in the way. @hukkin I'll check your PR out tomorrow. Give me 24 hours.

eksortso · 2021-08-17T03:05:46Z

I left some notes on the PR. My preference is for @pradyunsg's language though. Newline normalization ought to be consistent between MLB strings and MLL strings, and it's made clear and definitive in MLL strings how newlines are treated differently from the rest of the content of the string.

pradyunsg added the clarification label Jul 11, 2021

srinivasreddy mentioned this issue Jul 14, 2021

Newline in keys should be escaped python-poetry/tomlkit#136

Closed

hukkin mentioned this issue Jul 22, 2021

Proposal: Allow carriage return as valid newline sequence #837

Closed

hukkin mentioned this issue Aug 15, 2021

Allow newline normalization in multi-line literal strings #842

Merged

pradyunsg closed this as completed in #842 Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Can newlines in multi-line literals be normalized? #835

Question: Can newlines in multi-line literals be normalized? #835

hukkin commented Jul 10, 2021

arp242 commented Jul 11, 2021

hukkin commented Jul 11, 2021

arp242 commented Jul 11, 2021

eksortso commented Jul 11, 2021 •

edited

Loading

hukkin commented Jul 11, 2021 •

edited

Loading

hukkin commented Jul 11, 2021

eksortso commented Jul 11, 2021

eksortso commented Jul 11, 2021

hukkin commented Jul 11, 2021 •

edited

Loading

hukkin commented Jul 30, 2021

pradyunsg commented Aug 11, 2021 •

edited

Loading

pradyunsg commented Aug 11, 2021 •

edited

Loading

hukkin commented Aug 11, 2021 •

edited

Loading

pradyunsg commented Aug 11, 2021

pradyunsg commented Aug 11, 2021 •

edited

Loading

hukkin commented Aug 15, 2021

eksortso commented Aug 16, 2021

eksortso commented Aug 17, 2021

Question: Can newlines in multi-line literals be normalized? #835

Question: Can newlines in multi-line literals be normalized? #835

Comments

hukkin commented Jul 10, 2021

arp242 commented Jul 11, 2021

hukkin commented Jul 11, 2021

arp242 commented Jul 11, 2021

eksortso commented Jul 11, 2021 • edited Loading

hukkin commented Jul 11, 2021 • edited Loading

hukkin commented Jul 11, 2021

eksortso commented Jul 11, 2021

eksortso commented Jul 11, 2021

hukkin commented Jul 11, 2021 • edited Loading

hukkin commented Jul 30, 2021

pradyunsg commented Aug 11, 2021 • edited Loading

pradyunsg commented Aug 11, 2021 • edited Loading

hukkin commented Aug 11, 2021 • edited Loading

pradyunsg commented Aug 11, 2021

pradyunsg commented Aug 11, 2021 • edited Loading

hukkin commented Aug 15, 2021

eksortso commented Aug 16, 2021

eksortso commented Aug 17, 2021

eksortso commented Jul 11, 2021 •

edited

Loading

hukkin commented Jul 11, 2021 •

edited

Loading

hukkin commented Jul 11, 2021 •

edited

Loading

pradyunsg commented Aug 11, 2021 •

edited

Loading

pradyunsg commented Aug 11, 2021 •

edited

Loading

hukkin commented Aug 11, 2021 •

edited

Loading

pradyunsg commented Aug 11, 2021 •

edited

Loading