-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enh(fsharp) Global overhaul #3348
Conversation
* Fix character literals not being parsed * Add missing keywords * Fix parsing of generic type symbols * Fix parsing of ML comments without space within * Improve attributes parsing * Add parsing of computation expressions * Add parsing of preprocessor and FSI directives * Add parsing for many operators * Add parsing of binary number literals * Add unit tests
src/languages/fsharp.js
Outdated
contains: ["self"] | ||
}), | ||
{ | ||
className: 'class', | ||
// type definitions: | ||
scope: 'title.class', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title.class
is for names of classes such as WordParser
, etc... (and class
alone is deprecated) you may want a multi-matcher here... what is the syntax you're trying to cover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this because I noticed class
was deprecated. I thought title.class
was the replacement.
The syntax I'm trying to cover is any type definition (the part before the first =
or (
):
type A = int * int // type abbreviation
type B = { FirstName: string; LastName: string } // record
type C = Circle of int | Rectangle of int * int // discriminated union
type D = Day | Month | Year // another discriminated union
type E<'a> = Choice1 of 'a | Choice2 of 'a * 'a // a generic type definition
type MyClass(initX: int) = // class
let x = initX
member _.Method() = printf "x=%i" x
It's currently parsed as follows:
I'm not sure I understand the meaning behind all the different standard scopes. What should I do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The syntax I'm trying to cover is any type definition (the part before the first = or ():
We do not care/desire this type of "semantic" flagging (that what we deprecated when we deprecated class
). IE we don't care about "class definitions". We only care about "class name/title" (which may or may not be in a definition). So above there should be NO nesting.
type
is a keywordA
is atitle.class
type A = int * int
(overall is nothing)
And also: * Add detection for many types, literals, and builtins * Add incomplete support for interpolated strings
I think I addressed most (all?) of your remarks. Sorry about the big diff. Most notable changes:
Since I wasn't sure whether you squash your PR or not, I added another commit with a more or less proper message. Thanks for your feedback, and tell me if I have missed something! |
src/languages/fsharp.js
Outdated
/\b[_a-z]\w*/, | ||
/(?=\s*\{)/ | ||
], | ||
beginScope: { 1: 'keyword' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Traditionally we'd just use regex.concat
or inline it (as it was before)... I'll give it some thought but not sure we want to start using multi-matchers just in this way (just to split out look-aheads).
src/languages/fsharp.js
Outdated
/(?<!<)<<(?!<)/, // << (but not < or <<<) | ||
/(?<!>)>>(?!>)/, // >> (but not > or >>>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't yet support negative lookbehind because not all browsers support it well yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, honestly I don't know what to do about that. What do you suggest?
- Match the variants I didn't want to match and accept having no real logic to what is highlighted or not? (
>>
is used to combine functions,>>>
is a binary right shift) - Do some magic I don't know about to replace the negative look behind?
- Do some hack by matching space before/after?
- ...?
(Do you know how it behaves on browsers that don't support it? Can't we keep it if it fails more or less gracefully)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Match the variants I didn't want to match and accept having no real logic to what is highlighted or not? (>> is used to combine functions, >>> is a binary right shift)
Why do we need look-behind just match >>>
, then >>
?... matching a thing doesn't mean you have to highlight it... often you will purposely match something to "consume" it just so that a later rule won't match and highlight.
If you don't want to highlight >>>
just match it without a scope. Though why are we highlighting some operators and not others?
Do some magic I don't know about to replace the negative look behind?
Nested rules, multi-matchers... there are hacks, but nothing that works exactly the same as a real look-behind.
Do you know how it behaves on browsers that don't support it? Can't we keep it if it fails more or less gracefully)
Different behavior on supported browsers it not at all desirable, so until it's universal we simply can't use it at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't want to highlight >>> just match it without a scope.
Ah that makes sense! I guess it didn't occur to me it was an option :)
Though why are we highlighting some operators and not others?
Mostly because I am afraid to highlight things in places where they should not be highlighted.
F# allows to define new operators using sequences of certain characters, and I didn't want to risk only partially highlighting a custom operator for example. So I thought sticking with standard non-arithmetic operators would be a good compromise for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, now that I'm looking closer not highlighting the basic math operators seems very strange to me... couldn't we just detect "operator like expressions" with a regex and just highlight ALL potential custom defined operators? I'm not sure that would have been my first choice, but I really dislike not highlighting the basic operators.
Would doing this conflict with other language features we're not already parsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would doing this conflict with other language features we're not already parsing?
That's part of what worries me... For example I think I removed explicit parsing of the generic type definition brackets to simplify things. They would need to be re-added, along with some more matching for this pattern when it's not part of a type definition (e.g. let f (x: inref<SomeStruct>) = x.SomeField
)
I fear there are a lot of patterns like that I don't necessarily have in mind...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially wanted to highlight only the most idiomatic operators (maybe |>
, <|
, >>
, <<
), but it seemed too arbitrary, that's why I settled on the set I chose...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are seeing why it's not a trivial issue. Many languages overload <>
just to name one (as you're seeing here with generics).
For example I think I removed explicit parsing of the generic type definition brackets to simplify things. They would need to be re-added
Generally the trend is to "parse" as little as possible and track as little context as possible to get the job done. Our engine allows us to do some CRAZY things (act much more like a real parser) - but that doesn't mean they are good ideas. Often the complexity and pain of future maintenance makes it ill-advised. If you think you have an easy solution I'm happy to look at it but if it looks too complex/messy it may get vetoed.
I fear there are a lot of patterns like that I don't necessarily have in mind...
This is the worry though... this is why JS doesn't have operator support yet, the <>
problem is a hard one... plus in JSX we have random bits of HTML also...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... now that you mention it, <>
is the boolean not operator in F# ^^'
It's a bit sad, but maybe not matching any operator is safer then...
It's really frustrating, especially after I thought I managed to do what I wanted with look-behinds...
src/languages/fsharp.js
Outdated
/<-/, | ||
/\.\./, | ||
/::/, | ||
/(?<!\|)\|(?!\|)/ // | (but not || or |||) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't yet support negative lookbehind because not all browsers support it well yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty sharp. I didn't focus on the relevance stuff as I'll take a final pass at that myself and likely change a few things. Traditionally we don't give operators relevance at all because it would be too hard to balance all the grammars. And there are other unspoken rules I don't expect you to be aware of. :)
No worries, language "reboots" are always a lot of effort...
Almost always squash (myself), but during dev I prefer to see the individual commits to make reviewing easier for long-running RPs.
That is appreciated! |
A one-liner saying the grammar has been rewritten and greatly improved sounds about right. |
src/languages/fsharp.js
Outdated
begin: /^\s*\[<(?=[<\w])/, | ||
excludeBegin: true, | ||
end: />\]/, | ||
excludeEnd: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also replace this excludeEnd
with a look-ahead in the end
regex, what do you think?
Should I also replace the excludeBegin
with a begin
regex in a look behind like so (with a look ahead for the expected content): /(?<=^\s*\[<)(?=\s*\w)/
?
(and I just noticed the current regex has an extra <
in the content look ahead that I inadvertently added when copy pasting from the csharp version...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and resolved these [lookahead vs excludeEnd] in my commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw, but what about the excludeBegin
? Should we also strive to replace them with look behinds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also strive to replace them with look behinds?
No, because we can't use look-behinds at all yet - see prior discussion. When we support look-behind that's definitely something we'd have a look at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't use look-behinds at all yet
Ah I did not understand that, I thought it was only negative look behind we couldn't use, my bad.
Why does the doc seems to indicate look-behinds in begin
are supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key is in the wording:
look-behind matching (when JS supports it) for begin (#2135)
The key being "when JS [browsers] supports it". Nothing in our engine prevents them from working for begin - if the JS engine includes support. (as you've seen when using them initially) If you wanted to fork the library or use a 3rd party grammar that uses them - go for it - you just lose support for Mac OS Safari. At least one 3rd party library does feature detection - resulting in inconsistent behavior across browsers - but thats' not a tradeoff we're interested in for Core.
However, the engine itself would break if look-behind was used in end
because of the way our engine is currently designed - the end matchers do not always have access to preceding content, so there is nothing to "look behind at".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'm still a bit confused by the subtlety, but ok ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I feel like some operators are an important way to distinguish F# from other languages (for example the If you don't want to add relevance to any operator, my split between "relevant" and "irrelevant" operators should be removed, as it seems rather fragile and might cause some problems if we are not careful (because of overlapping matches, e.g. between
Honestly, I didn't expect I would spend so much time on this, but now I'm in too deep to go back :) |
"__LINE__", | ||
"__SOURCE_DIRECTORY__", | ||
"__SOURCE_FILE__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider variable.language
or variable.constant
? These look like contstants to me, not literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I didn't notice these scopes!
I'm not sure I understand all the nuances between keyword
, built_in
, literal
, variable.language
, and variable.constant
... (the doc is sometimes confusing, e.g. for built_in
it mentions it could be used for constants?)
Which one should be used for user-defined constants, and which one for language level constants/macro?
As I understand it, __LINE__
and co are some kind of magical constants whose values are set by the compiler. Does that fit with variable.constant
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand all the nuances between keyword, built_in, literal, variable.language, and variable.constant...
Me either, LOL... it's a moving target sometimes (with the new ones). The more nuanced ones are newer scopes and we're trying to use them more. In the past the older scopes did more "work"...
With regard to current understanding of the older scopes:
- keyword - key language reserved words (if, else, break, continue, etc)
- built_in (built in library functions or variables) - though now this can intermingle with
variable.language
as well astitle.function
and can sometimes be nebulous - literals - the name is literally the value,
true
,false
,null
Which one should be used for user-defined constants
variable.constant
which one for language level constants/macro?
I think variable.constant
is correct... variable.language
[by contrast] often refers to things such as this
, self
, etc... variables that are magical to the language in question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like F# allows you to define your own self identifier so sadly that's not something we can highlight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for the explanations!
Looks like F# allows you to define your own self identifier so sadly that's not something we can highlight.
Are you refering to this
or self
? If so, I think it's more that F# does not define such a keyword, and allows the developer to use whatever variable name they want to refer to the current object: https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/classes#self-identifiers
EDIT: hum, actually that's almost exactly what you said!
src/languages/fsharp.js
Outdated
"failwithf" | ||
]; | ||
|
||
// (* potentially multi-line ML style comment *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "multi-line ML" is that not redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ML as in "Metal Language", as in "ML family language". It's unfortunate it can also be interpreted as multi-line.
(these languages all have in common (* a specific syntax for comments *)
).
Would you prefer I replaced the acronym with the full name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you prefer I replaced the acronym with the full name?
Yep!
Except it's not exclusive. The pipe operator The problem is we have almost 200 first-party languages and unless FULL operator support is added to all of them simultaneously (which is a HUGE undertaking, and quite possibly also impossible for some) then auto-detect is going to be just entirely broken if we start giving a select few relevance points. For example allowing F# to score relevance would lead to many false positives of Elixir code (which doesn't yet include operator relevance)... etc, etc. So the first task would be: Add operator highlighting to ALL grammars. After that was done one could circle back and look at the pros and cons of using it for relevance. There are still other issues... even if you think you have a cool exclusive operator I've seen the weirdest ASCII art in comments that can also be problematic, such as a
Yes I've seen such things. :-) This may be slightly improved since now we have |
About operator relevance: ok I see, thanks for the explanation, it was probably a wrong assumption/rookie mistake on my part :) |
I added several commits with fixes.
still need to address the negative look-behinds for operators, but it's proving trickier than just matching some without assigning them a scope... |
Let me know if I can help, operators are devilishly hard in some languages because we are only a pattern matcher and can't understand the context of operators used as non-operators, though perhaps you're dealing with something else entirely. |
Well, I only just begun matching {
// consume some operators without assigning them a scope:
match: regex.either(
// we only want to match << and >> for now, so ignore these:
/</,
/<<</,
/>/,
/>>>/
),
relevance: 0
},
{
// only non arithmetic operators that we can confidently match:
scope: 'operator',
match: regex.either(
/\|{1,3}>/, // |> ||> |||>
/<\|{1,3}/, // <| <|| <|||
/<</,
/>>/,
/:>/,
/:\?>?/, // :?> :?
/->/,
/<-/,
/\.\./,
/::/,
/(?<!\|)\|(?!\|)/ // | (but not || or |||)
),
relevance: 0
}, |
Order matters, take:
This will never match
Yes, because the |
Does F# have a default formatter or well enforced code/lint standards? Sometimes this can help as I don't mind taking that into account if it's common enough as to be a non-brainer. |
I think fantomas is becoming the de-facto code formatter (It's even mentioned in the official F# code formatting guidelines) But I'm not sure everyone agrees on the details, and the guidelines mention some formatting that is accepted but that tools will override anyway... (I'm not sure what comment you were replying to and what you have in mind?) |
Sometimes if it helps we can "ignore" poorly formatted code and often that makes our job easier since then there are often guaranteed space boundaries around some things that make detection slightly simple. Until the code golf people show up. :-) |
Ah yes, I don't feel comfortable relying on this but I see the appeal :) |
It can sometimes work well when it's scoped to certain parts of the grammar... "Oh your operators aren't always highlighted, then perhaps write more idiomatic code with nice whitespace, etc..."... and if it means the difference between not highlighting operators at all for anyone or highlighting them for 95% of people who write idiomatic code, then I'd err on the latter. |
It's too complicated and error prone, let's revisit this later...
src/languages/fsharp.js
Outdated
{ | ||
// only non arithmetic operators that we can confidently match: | ||
scope: 'operator', | ||
match: regex.either( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update test suite also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, not sure how merging main
back in broke this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, only sync ran...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need to make the tests green again.
Since F# allows most of its grammar to be present inside string placeholders, I moved several modes to constants and used them both in the main mode and in the SUBST mode (used for string placeholders). Matching interpolated strings was previously very buggy, because INTERPOLATED_TRIPLE_QUOTED_STRING.begin was matching on $ instead of \$ -_-"
Okay, hopefully this was the last big change. I was ready to also remove the (buggy) support for interpolated strings ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost. :)
@mlaily Thanks for all the hard work! |
And thanks for the long review and for being patient with me! |
Fixes #3347 as well as a lot of other things F# related.
Changes
Here is the high level list of things that changed (Also listed in the commit description):
Checklist
CHANGES.md
I have two questions left:
relevance
black magic", but I'm not sure where to start)