Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh(fsharp) Global overhaul #3348

Merged
merged 9 commits into from
Oct 7, 2021
Merged

enh(fsharp) Global overhaul #3348

merged 9 commits into from
Oct 7, 2021

Conversation

mlaily
Copy link
Contributor

@mlaily mlaily commented Oct 3, 2021

Fixes #3347 as well as a lot of other things F# related.

Changes

Here is the high level list of things that changed (Also listed in the commit description):

  • Fix character literals not being parsed
  • Add missing keywords
  • Fix parsing of generic type symbols
  • Fix parsing of ML comments without space within
  • Improve attributes parsing
  • Add parsing of computation expressions
  • Add parsing of preprocessor and FSI directives
  • Add parsing for many operators
  • Add parsing of binary number literals
  • Add unit tests

Checklist

  • Added markup tests
  • Updated the changelog at CHANGES.md

I have two questions left:

  • What should I put in the changelog?
  • It seems I have broken some language auto-detection tests. What's the best way to fix it? (I suspect it's part of the "relevance black magic", but I'm not sure where to start)

* Fix character literals not being parsed
* Add missing keywords
* Fix parsing of generic type symbols
* Fix parsing of ML comments without space within
* Improve attributes parsing
* Add parsing of computation expressions
* Add parsing of preprocessor and FSI directives
* Add parsing for many operators
* Add parsing of binary number literals
* Add unit tests
contains: ["self"]
}),
{
className: 'class',
// type definitions:
scope: 'title.class',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

title.class is for names of classes such as WordParser, etc... (and class alone is deprecated) you may want a multi-matcher here... what is the syntax you're trying to cover?

Copy link
Contributor Author

@mlaily mlaily Oct 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this because I noticed class was deprecated. I thought title.class was the replacement.

The syntax I'm trying to cover is any type definition (the part before the first = or ():

type A = int * int // type abbreviation
type B = { FirstName: string; LastName: string } // record
type C = Circle of int | Rectangle of int * int // discriminated union
type D = Day | Month | Year // another discriminated union
type E<'a> = Choice1 of 'a | Choice2 of 'a * 'a // a generic type definition

type MyClass(initX: int) = // class
   let x = initX
   member _.Method() = printf "x=%i" x

It's currently parsed as follows:
image

I'm not sure I understand the meaning behind all the different standard scopes. What should I do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax I'm trying to cover is any type definition (the part before the first = or ():

We do not care/desire this type of "semantic" flagging (that what we deprecated when we deprecated class). IE we don't care about "class definitions". We only care about "class name/title" (which may or may not be in a definition). So above there should be NO nesting.

  • type is a keyword
  • A is a title.class
  • type A = int * int (overall is nothing)

And also:
* Add detection for many types, literals, and builtins
* Add incomplete support for interpolated strings
@mlaily
Copy link
Contributor Author

mlaily commented Oct 4, 2021

I think I addressed most (all?) of your remarks.

Sorry about the big diff.

Most notable changes:

  • I replaced my string regex with different modes (I used csharp as a starting point)
  • In the process, I added some support for interpolated strings ($"...{1+1}..." // prints ...2...).
    • Some edge cases do not parse correctly, but it's better than before so I left it.
    • test/markup/fsharp/strings.txt contains some breaking examples.
  • I split all the big remaining regexes
  • I used multi-matching where it was possible and beneficial
  • I added detection for many builtin types and other literals (initially to help with auto-detection, but it is also nicer to look at)
  • I managed to fix auto-detection by fiddling with relevance a bit (notably by splitting operators detection and assign a zero relevance to some)

Since I wasn't sure whether you squash your PR or not, I added another commit with a more or less proper message.

Thanks for your feedback, and tell me if I have missed something!

/\b[_a-z]\w*/,
/(?=\s*\{)/
],
beginScope: { 1: 'keyword' }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traditionally we'd just use regex.concat or inline it (as it was before)... I'll give it some thought but not sure we want to start using multi-matchers just in this way (just to split out look-aheads).

Comment on lines 421 to 422
/(?<!<)<<(?!<)/, // << (but not < or <<<)
/(?<!>)>>(?!>)/, // >> (but not > or >>>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't yet support negative lookbehind because not all browsers support it well yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, honestly I don't know what to do about that. What do you suggest?

  • Match the variants I didn't want to match and accept having no real logic to what is highlighted or not? (>> is used to combine functions, >>> is a binary right shift)
  • Do some magic I don't know about to replace the negative look behind?
  • Do some hack by matching space before/after?
  • ...?

(Do you know how it behaves on browsers that don't support it? Can't we keep it if it fails more or less gracefully)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match the variants I didn't want to match and accept having no real logic to what is highlighted or not? (>> is used to combine functions, >>> is a binary right shift)

Why do we need look-behind just match >>>, then >>?... matching a thing doesn't mean you have to highlight it... often you will purposely match something to "consume" it just so that a later rule won't match and highlight.

If you don't want to highlight >>> just match it without a scope. Though why are we highlighting some operators and not others?

Do some magic I don't know about to replace the negative look behind?

Nested rules, multi-matchers... there are hacks, but nothing that works exactly the same as a real look-behind.

Do you know how it behaves on browsers that don't support it? Can't we keep it if it fails more or less gracefully)

Different behavior on supported browsers it not at all desirable, so until it's universal we simply can't use it at all.

Copy link
Contributor Author

@mlaily mlaily Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to highlight >>> just match it without a scope.

Ah that makes sense! I guess it didn't occur to me it was an option :)

Though why are we highlighting some operators and not others?

Mostly because I am afraid to highlight things in places where they should not be highlighted.
F# allows to define new operators using sequences of certain characters, and I didn't want to risk only partially highlighting a custom operator for example. So I thought sticking with standard non-arithmetic operators would be a good compromise for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, now that I'm looking closer not highlighting the basic math operators seems very strange to me... couldn't we just detect "operator like expressions" with a regex and just highlight ALL potential custom defined operators? I'm not sure that would have been my first choice, but I really dislike not highlighting the basic operators.

Would doing this conflict with other language features we're not already parsing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would doing this conflict with other language features we're not already parsing?

That's part of what worries me... For example I think I removed explicit parsing of the generic type definition brackets to simplify things. They would need to be re-added, along with some more matching for this pattern when it's not part of a type definition (e.g. let f (x: inref<SomeStruct>) = x.SomeField)

I fear there are a lot of patterns like that I don't necessarily have in mind...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially wanted to highlight only the most idiomatic operators (maybe |>, <|, >>, <<), but it seemed too arbitrary, that's why I settled on the set I chose...

Copy link
Member

@joshgoebel joshgoebel Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are seeing why it's not a trivial issue. Many languages overload <> just to name one (as you're seeing here with generics).

For example I think I removed explicit parsing of the generic type definition brackets to simplify things. They would need to be re-added

Generally the trend is to "parse" as little as possible and track as little context as possible to get the job done. Our engine allows us to do some CRAZY things (act much more like a real parser) - but that doesn't mean they are good ideas. Often the complexity and pain of future maintenance makes it ill-advised. If you think you have an easy solution I'm happy to look at it but if it looks too complex/messy it may get vetoed.

I fear there are a lot of patterns like that I don't necessarily have in mind...

This is the worry though... this is why JS doesn't have operator support yet, the <> problem is a hard one... plus in JSX we have random bits of HTML also...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... now that you mention it, <> is the boolean not operator in F# ^^'

It's a bit sad, but maybe not matching any operator is safer then...

It's really frustrating, especially after I thought I managed to do what I wanted with look-behinds...

/<-/,
/\.\./,
/::/,
/(?<!\|)\|(?!\|)/ // | (but not || or |||)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't yet support negative lookbehind because not all browsers support it well yet.

Copy link
Member

@joshgoebel joshgoebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty sharp. I didn't focus on the relevance stuff as I'll take a final pass at that myself and likely change a few things. Traditionally we don't give operators relevance at all because it would be too hard to balance all the grammars. And there are other unspoken rules I don't expect you to be aware of. :)

@joshgoebel
Copy link
Member

Sorry about the big diff.

No worries, language "reboots" are always a lot of effort...

Since I wasn't sure whether you squash your PR or not,

Almost always squash (myself), but during dev I prefer to see the individual commits to make reviewing easier for long-running RPs.

I added another commit with a more or less proper message.

That is appreciated!

@joshgoebel
Copy link
Member

What should I put in the changelog?

A one-liner saying the grammar has been rewritten and greatly improved sounds about right.

begin: /^\s*\[<(?=[<\w])/,
excludeBegin: true,
end: />\]/,
excludeEnd: true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also replace this excludeEnd with a look-ahead in the end regex, what do you think?

Should I also replace the excludeBegin with a begin regex in a look behind like so (with a look ahead for the expected content): /(?<=^\s*\[<)(?=\s*\w)/?

(and I just noticed the current regex has an extra < in the content look ahead that I inadvertently added when copy pasting from the csharp version...)

Copy link
Member

@joshgoebel joshgoebel Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and resolved these [lookahead vs excludeEnd] in my commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I saw, but what about the excludeBegin? Should we also strive to replace them with look behinds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also strive to replace them with look behinds?

No, because we can't use look-behinds at all yet - see prior discussion. When we support look-behind that's definitely something we'd have a look at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't use look-behinds at all yet

Ah I did not understand that, I thought it was only negative look behind we couldn't use, my bad.

Why does the doc seems to indicate look-behinds in begin are supported?

Copy link
Member

@joshgoebel joshgoebel Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key is in the wording:

look-behind matching (when JS supports it) for begin (#2135)

The key being "when JS [browsers] supports it". Nothing in our engine prevents them from working for begin - if the JS engine includes support. (as you've seen when using them initially) If you wanted to fork the library or use a 3rd party grammar that uses them - go for it - you just lose support for Mac OS Safari. At least one 3rd party library does feature detection - resulting in inconsistent behavior across browsers - but thats' not a tradeoff we're interested in for Core.

However, the engine itself would break if look-behind was used in end because of the way our engine is currently designed - the end matchers do not always have access to preceding content, so there is nothing to "look behind at".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm still a bit confused by the subtlety, but ok ^^

Copy link
Member

@joshgoebel joshgoebel Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlaily
Copy link
Contributor Author

mlaily commented Oct 4, 2021

This is looking pretty sharp. I didn't focus on the relevance stuff as I'll take a final pass at that myself and likely change a few things. Traditionally we don't give operators relevance at all because it would be too hard to balance all the grammars. And there are other unspoken rules I don't expect you to be aware of. :)

Thanks. I feel like some operators are an important way to distinguish F# from other languages (for example the |> operator used everywhere), but I'll leave it to you.

If you don't want to add relevance to any operator, my split between "relevant" and "irrelevant" operators should be removed, as it seems rather fragile and might cause some problems if we are not careful (because of overlapping matches, e.g. between | and |>)

No worries, language "reboots" are always a lot of effort...

Honestly, I didn't expect I would spend so much time on this, but now I'm in too deep to go back :)

Comment on lines +109 to +111
"__LINE__",
"__SOURCE_DIRECTORY__",
"__SOURCE_FILE__"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider variable.language or variable.constant? These look like contstants to me, not literals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I didn't notice these scopes!

I'm not sure I understand all the nuances between keyword, built_in, literal, variable.language, and variable.constant... (the doc is sometimes confusing, e.g. for built_in it mentions it could be used for constants?)

Which one should be used for user-defined constants, and which one for language level constants/macro?
As I understand it, __LINE__ and co are some kind of magical constants whose values are set by the compiler. Does that fit with variable.constant?

Copy link
Member

@joshgoebel joshgoebel Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand all the nuances between keyword, built_in, literal, variable.language, and variable.constant...

Me either, LOL... it's a moving target sometimes (with the new ones). The more nuanced ones are newer scopes and we're trying to use them more. In the past the older scopes did more "work"...

With regard to current understanding of the older scopes:

  • keyword - key language reserved words (if, else, break, continue, etc)
  • built_in (built in library functions or variables) - though now this can intermingle with variable.language as well as title.function and can sometimes be nebulous
  • literals - the name is literally the value, true, false, null

Which one should be used for user-defined constants

variable.constant

which one for language level constants/macro?

I think variable.constant is correct... variable.language [by contrast] often refers to things such as this, self, etc... variables that are magical to the language in question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like F# allows you to define your own self identifier so sadly that's not something we can highlight.

Copy link
Contributor Author

@mlaily mlaily Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the explanations!

Looks like F# allows you to define your own self identifier so sadly that's not something we can highlight.

Are you refering to this or self? If so, I think it's more that F# does not define such a keyword, and allows the developer to use whatever variable name they want to refer to the current object: https://docs.microsoft.com/en-us/dotnet/fsharp/language-reference/classes#self-identifiers

EDIT: hum, actually that's almost exactly what you said!

"failwithf"
];

// (* potentially multi-line ML style comment *)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "multi-line ML" is that not redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ML as in "Metal Language", as in "ML family language". It's unfortunate it can also be interpreted as multi-line.

(these languages all have in common (* a specific syntax for comments *)).

Would you prefer I replaced the acronym with the full name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer I replaced the acronym with the full name?

Yep!

@joshgoebel
Copy link
Member

Thanks. I feel like some operators are an important way to distinguish F# from other languages (for example the |> operator used everywhere),

Except it's not exclusive. The pipe operator |> (as I know it) is used everywhere in Elixir and there are proposals to add it to Wren also - and that's just two other languages I know of off the top of my head.

The problem is we have almost 200 first-party languages and unless FULL operator support is added to all of them simultaneously (which is a HUGE undertaking, and quite possibly also impossible for some) then auto-detect is going to be just entirely broken if we start giving a select few relevance points.

For example allowing F# to score relevance would lead to many false positives of Elixir code (which doesn't yet include operator relevance)... etc, etc.

So the first task would be: Add operator highlighting to ALL grammars.

After that was done one could circle back and look at the pros and cons of using it for relevance.


There are still other issues... even if you think you have a cool exclusive operator I've seen the weirdest ASCII art in comments that can also be problematic, such as a *= operator in a hash comment language for example and then you have C code like:

/* =*=*=*=*=*=*=*=*=*= main code =*=*=*=*=*=*=*=*=*= */

Yes I've seen such things. :-) This may be slightly improved since now we have MAX_KEYWORD_HITS to protect us against keyword stuffing, but until someone shows me otherwise I'm going to be conservative on operator relevance for now.

@mlaily
Copy link
Contributor Author

mlaily commented Oct 5, 2021

About operator relevance: ok I see, thanks for the explanation, it was probably a wrong assumption/rookie mistake on my part :)

@mlaily
Copy link
Contributor Author

mlaily commented Oct 5, 2021

I added several commits with fixes.

I replaced the excludeBegin with a look-behind in a separate commit, feel free to discard it if you prefer excludeBegin.
EDIT: I removed the look behind commit.

still need to address the negative look-behinds for operators, but it's proving trickier than just matching some without assigning them a scope...

@joshgoebel
Copy link
Member

but it's proving trickier than just matching some without assigning them a scope...

Let me know if I can help, operators are devilishly hard in some languages because we are only a pattern matcher and can't understand the context of operators used as non-operators, though perhaps you're dealing with something else entirely.

@mlaily
Copy link
Contributor Author

mlaily commented Oct 5, 2021

but it's proving trickier than just matching some without assigning them a scope...

Let me know if I can help, operators are devilishly hard in some languages because we are only a pattern matcher and can't understand the context of operators used as non-operators, though perhaps you're dealing with something else entirely.

Well, I only just begun matching <, >, <<<, and >>> without a scope so that I can then match << and >> (the only ones I want), and it's breaking all the constructs containing a <, like <- or <|. What I don't understand is that -> and |> are still matched. It does not make sense to me!

{
  // consume some operators without assigning them a scope:
  match: regex.either(
    // we only want to match << and >> for now, so ignore these:
    /</,
    /<<</,
    />/,
    />>>/
  ),
  relevance: 0
},
{
// only non arithmetic operators that we can confidently match:
scope: 'operator',
match: regex.either(
  /\|{1,3}>/, // |> ||> |||>
  /<\|{1,3}/, // <| <|| <|||
  /<</,
  />>/,
  /:>/,
  /:\?>?/, // :?> :?
  /->/,
  /<-/,
  /\.\./,
  /::/,
  /(?<!\|)\|(?!\|)/ // | (but not || or |||)
),
relevance: 0
},

image

@joshgoebel
Copy link
Member

joshgoebel commented Oct 5, 2021

Order matters, take:

    /</,
    /<<</,

This will never match <<< because the < will gobble up all the < one by one... The engine just matches rules sequentially and moves on. <<< should come first.

and it's breaking all the constructs containing a <, like <- or <|

Yes, because the < has been gobbled up by your "consumer". You'd need to use a negative look-ahead on your "consumer" to only match < when it wasn't followed by -, |, etc...

@joshgoebel
Copy link
Member

Does F# have a default formatter or well enforced code/lint standards? Sometimes this can help as I don't mind taking that into account if it's common enough as to be a non-brainer.

@mlaily
Copy link
Contributor Author

mlaily commented Oct 5, 2021

Does F# have a default formatter or well enforced code/lint standards? Sometimes this can help as I don't mind taking that into account if it's common enough as to be a non-brainer.

I think fantomas is becoming the de-facto code formatter (It's even mentioned in the official F# code formatting guidelines)

But I'm not sure everyone agrees on the details, and the guidelines mention some formatting that is accepted but that tools will override anyway...

(I'm not sure what comment you were replying to and what you have in mind?)

@joshgoebel
Copy link
Member

(I'm not sure what comment you were replying to and what you have in mind?)

Sometimes if it helps we can "ignore" poorly formatted code and often that makes our job easier since then there are often guaranteed space boundaries around some things that make detection slightly simple. Until the code golf people show up. :-)

@mlaily
Copy link
Contributor Author

mlaily commented Oct 5, 2021

Sometimes if it helps we can "ignore" poorly formatted code and often that makes our job easier since then there are often guaranteed space boundaries around some things that make detection slightly simple. Until the code golf people show up. :-)

Ah yes, I don't feel comfortable relying on this but I see the appeal :)

@joshgoebel
Copy link
Member

joshgoebel commented Oct 6, 2021

Ah yes, I don't feel comfortable relying on this but I see the appeal :)

It can sometimes work well when it's scoped to certain parts of the grammar... "Oh your operators aren't always highlighted, then perhaps write more idiomatic code with nice whitespace, etc..."... and if it means the difference between not highlighting operators at all for anyone or highlighting them for 95% of people who write idiomatic code, then I'd err on the latter.

mlaily and others added 2 commits October 6, 2021 22:31
It's too complicated and error prone, let's revisit this later...
{
// only non arithmetic operators that we can confidently match:
scope: 'operator',
match: regex.either(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update test suite also.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, not sure how merging main back in broke this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, only sync ran...

Copy link
Member

@joshgoebel joshgoebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to make the tests green again.

Since F# allows most of its grammar to be present
inside string placeholders, I moved several modes
to constants and used them both in the main mode
and in the SUBST mode (used for string placeholders).

Matching interpolated strings was previously very buggy,
because INTERPOLATED_TRIPLE_QUOTED_STRING.begin
was matching on $ instead of \$    -_-"
@mlaily
Copy link
Contributor Author

mlaily commented Oct 6, 2021

Okay, hopefully this was the last big change.

I was ready to also remove the (buggy) support for interpolated strings ($"{1+1}") when I noticed I had a non-escaped $.
Following this discovery, I managed to fix this part of the grammar completely! (at least I didn't find a way to break it, and I tried pretty hard).

Copy link
Member

@joshgoebel joshgoebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost. :)

@joshgoebel
Copy link
Member

@mlaily Thanks for all the hard work!

@joshgoebel joshgoebel merged commit 026a71f into highlightjs:main Oct 7, 2021
@mlaily
Copy link
Contributor Author

mlaily commented Oct 7, 2021

@mlaily Thanks for all the hard work!

And thanks for the long review and for being patient with me!

@mlaily mlaily mentioned this pull request Oct 11, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(F#,fsharp) Wrong tokenization
2 participants