-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Can newlines in multi-line literals be normalized? #835
Comments
What exactly do you mean with "normalized"? ABNF only describes the syntax (that it, what characters are allowed), and doesn't really describe the semantics (i.e. how it's interpreted or processed). |
Normalize as in convert CRLF to LF or vice versa. |
Oh right. That's basically just up to the implementation; from https://toml.io/en/v1.0.0 :
|
@hukkin Good catch. Definitely a subtle point. My personal take is that what matters most about newlines in MLL strings isn't how they appear in the document but how the resulting string represents them, for two reasons. Primarily, aside from newline characters, the only control character permitted in an MLL string is the tab. So fine control over the special characters Second, as noted, the precedent for newline normalization is set in the Multiline Basic string definition, which should be repeated in the MLL string definition:
I'm willing to draft a PR to set this straight. |
Would be great, thanks! Should we make the sentence I quoted in the first post (that says "as-is without modification" etc.) mention this exception? Now that I have your attention 😄 I've got another question related to newline normalization: is a parser allowed to normalize more than once? I'll try to explain what this means with an example where I use
A parser is allowed to convert the
And on second parse we're allowed to parse
So yeah the question is, can a parser add/remove an arbitrary amount of CR characters to prefix an LF character. Or is it only allowed to add/remove one? Could or should the spec take a stance in this? Maybe it would be easiest to redefine newline using the regex |
Oh also, if we allow newline normalization in multi-line literals, I think we may as well generalize it so that newline normalization is allowed globally in the entire TOML document? |
I'll need to think about that.
Gotta stop you there. This example is invalid. The last three characters inside the delimiters are two literal CRs followed by a literal LF... but that first CR is not permitted. It's not part of a CRLF, it isn't a newline on its own, and aside from tabs, no other control code is allowed. [...]
Every CR in a literal string must be followed by a LF. So normalization would only happen once at most. |
That doesn't make sense. The only places within a TOML document where a physical newline is converted into a value containing a newline are multiline strings. Nowhere else would a parser be called upon to normalize newlines. Arrays may have newlines between values, but those newlines aren't part of the containing array when decoded. |
Oh of course, thanks!
Yeah perhaps you're right. The sense to me would be
|
I'd much appreciate a maintainer's take on this issue (maybe @pradyunsg ?). Currently my parser does newline normalization in multi-line literals which strictly speaking is a spec violation in TOML v1.0.0. But if the spec is about to change in this regard, then I'd prefer to keep the bug and wait until the spec is in line with it. |
Sorry, I've not had time to look at this properly before today. I've tried to figure out how OP I reading the specification, to determine that newline normalization is not permitted, at least twice before and failed to understand how they're parsing the prose in the specification.
An ABNF grammar describes what inputs to a parser are acceptable (ignoring semantics). What those inputs translate into as values is not something that it covers -- that is described in the prose. To answer the question in OP, yes. I don't think we should make newlines (which are invisible characters) significant somehow. If some parsers want to not normalize them, they're free to. Same is true for the other way round though. What a TOML parser does with newlines in a multi-line string is their prerogative. I think that the language in the specification says that parsers are free to do whatever they want w.r.t. newline normalization. Given a process of unnormalised-text -> unnormalised-parsed -> normalised-text -> normalised-parsed; if both the parsed values look the same, there's absolutely no harm in normalising the content. Whether a encoder-decoder pair care about the potential for mismatch between unnormalised-text and normalised-text is a decision for the parser author to make. I don't see any reason what we'd want to disallow or require either approach. I expect most parser authors will take the approach that's easier in their language (eg: using Python's built in normalisation). Note that this is about CRLF in the document, NOT \r\n within the content, which MUST be preserved. I'm happy to accept a PR clarifying this and amending the prose to make this clearer. I personally don't know how to do that, because I don't understand OP's source of confusion. |
Given: str = '''Here's a newline{CR}{CR}{LF}''' This is currently invalid, as @eksortso noted already. Assuming #837 happens and you're normalising newlines, you'd parse this as the JSON: {"str": "Here's a newline\n\n"} It would be invalid to write this as: str = '''Here's a newline{CR}{LF}''' That's because you've changed the value. Reparsing this (again, with normalisation) would yield: {"str": "Here's a newline\n"} Notice how they're different? There's a missing newline, and you've changed the semantic meaning of the file. Valid ways to write the parsed value would be: str = '''Here's a newline{CR}{CR}{LF}''' str = '''Here's a newline{CR}{LF}{CR}{LF}''' str = '''Here's a newline{LF}{LF}''' str = '''Here's a newline{CR}{CR}''' I don't care which form the parsers choose. That's a flexibility the specification already provides the parsers. That said, two and three are probably what most parsers will spew out and what text editors (when configured to do so) will normalise newlines to. All of them parse the same when renormalised. |
The source of confusion is that newline normalisation is only mentioned in the context of multi-line basic strings. When it comes to multi-line literal strings, no newline normalisation is mentioned in the spec. And further to this, the spec says "content between the [multi-line literal] delimiters is interpreted as-is without modification" implying that newline normalisation or any other modifications are prohibited. So even if you say that newline normalisation is allowed in multi-line literals (and I agree it should be), I'd argue that current TOML spec (v1.0.0) does not agree with us. |
Oh! I think I see it now! You're reading that to include {CR}/{LF} characters within "content" and I'm not. Let's tweak the wording, to avoid the possibility of reading it that way. |
BTW @hukkin, I believe we're on at least one common Slack (EBP) and you have a mention there. :) |
Went ahead and took a stab at this #842 😸 |
Sorry for vanishing lately. Life's been in the way. @hukkin I'll check your PR out tomorrow. Give me 24 hours. |
I left some notes on the PR. My preference is for @pradyunsg's language though. Newline normalization ought to be consistent between MLB strings and MLL strings, and it's made clear and definitive in MLL strings how newlines are treated differently from the rest of the content of the string. |
The spec only speaks of newline normalization in the context of multi-line basic strings and is in fact quite literal about the fact that "content between the [multi-line literal] delimiters is interpreted as-is without modification".
I'm not an expert on reading ABNF, but the ABNF spec says
that is, it does not include
newline
inmll-char
(multi-line literal char), making me think perhaps the normalization is, nonetheless, allowed in multi-line literals as well.I'd much appreciate clarification/someone else's interpretation.
The text was updated successfully, but these errors were encountered: