Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Tokens - Reduce set of non-standard token types #13066

Closed
frankplow opened this issue Aug 19, 2022 · 29 comments · Fixed by #14777
Closed

Semantic Tokens - Reduce set of non-standard token types #13066

frankplow opened this issue Aug 19, 2022 · 29 comments · Fixed by #14777
Assignees
Labels
A-highlighting (semantic) token highlighting A-lsp LSP conformance issues and missing features C-enhancement Category: enhancement

Comments

@frankplow
Copy link

frankplow commented Aug 19, 2022

At present, a built-in type is conveyed to the client using a token with type builtinType. This is not in the standard set of semantic token types and therefore requires configuration in the client. The same semantic idea can be conveyed using a token with type type and the defaultLibrary modifier, both of which are part of the standard set of token types and modifiers.

@frankplow frankplow changed the title Semantic Tokens - builtInType should be composed of type and defaultLibrary rather than being its own type Semantic Tokens - builtinType should be composed of type and defaultLibrary rather than being its own type Aug 19, 2022
@frankplow
Copy link
Author

There are actually a number of custom types which I think could be replaced with standard token types without losing meaning. I think this is a holdover from the deprecated semantic highlighting LSP extension, which had no set of standard types and did not support modifiers. To preserve meaning when replacing types with standard types, non-standard modifiers could be used as in this case the client can configure sensible fallbacks for the highlighting. Distributions of rust-analyzer, such as the VSCode plugin, generally provide configurations for these custom types, hence this is not a massive issue, however maintenance of these distributions could be simplified and creation of new distributions for different clients made easier by slimming down the set of custom token types.

Here are a few suggestions (type + modifiers):

  • attribute -> decorator; these may not have quite identical meanings, but given that Rust only has attributes this should serve to provide some sensible default highlighting for attributes.
  • builtinAttribute -> decorator + defaultLibrary
  • derive -> decorator + derive
  • deriveHelper -> decorator + derive
  • constParameter -> parameter + readonly
  • trait -> interface; not too sure about this one
  • typeAlias -> type
  • arithmetic -> operator + arithmetic
  • bitwise -> operator + bitwise
  • comparison -> operator + comparison
  • logical -> operator + logical
  • attributeBracket -> punctuation + attributeBracket
  • angle -> punctuation + angleBracket
  • brace -> punctuation + curlyBracket
  • parenthesis -> punctuation + parenthesis
  • colon -> punctuation + colon
  • comma -> punctuation + comma
  • dot -> punctuation + period
  • semi -> punctuation + semicolon
  • macroBang -> punctuation + exclamationMark; or made part of its corresponding macro, I don't think many want to highlight this differently to the preceding word.
  • selfKeyword -> parameter + self
  • selfTypeKeyword -> type + self

I am happy to implement this, but think it is worth getting some consensus on which changes should be made first. If you have any comments on the suggestions made above or your own suggestions, please add them. The full set of current token types can be found here.

@frankplow frankplow changed the title Semantic Tokens - builtinType should be composed of type and defaultLibrary rather than being its own type Semantic Tokens - Reduce set of non-standard token types Aug 19, 2022
@flodiebold
Copy link
Member

Built-in types are not the same thing as standard library types, which is what defaultLibrary indicates.

@frankplow
Copy link
Author

A non-standard builtin modifier could be introduced then?

  • builtinType -> Type + builtin
  • builtinAttribute -> decorator + builtin

@Veykril
Copy link
Member

Veykril commented Aug 19, 2022

I fail to see why adding a ton of modifiers would be a better solution than what we currently have. That is, why exactly are non-standard modifiers over non-standard types better in your eyes?

On another note, yes we should be using the standard types where it makes sense, we already use interface for traits as that is basically the equivalent in rust. decorator didn't exist when we introduced the attribute type, so that one I think certainly makes sense to use instead now.

@Veykril Veykril added the A-highlighting (semantic) token highlighting label Aug 19, 2022
@frankplow
Copy link
Author

frankplow commented Aug 19, 2022

@Veykril

Why exactly are non-standard modifiers over non-standard types better in your eyes?

Most clients implement fallbacks from unrecognised modifers to the un-modified type. For example, type + builtin would fall back to type, which is what I suspect most users would desire anyway, whereas builtinType has nothing to fall back to and would not be semantically highlighted.

@Veykril
Copy link
Member

Veykril commented Aug 19, 2022

Hmm I see, the thing is we are limited in how many token modifiers we can have, which is currently 32 due to the bitset requirement. It also feels like the wrong usage of token modifiers to me and instead signals a lack of the language server protocol in that we really need to be able to specify a hierarchy of token types which I'd much rather see. VSCode itself already supports this, but the LSP unfortunately doesn't.

@Veykril Veykril added the A-lsp LSP conformance issues and missing features label Aug 19, 2022
@frankplow
Copy link
Author

A point to be made is that the feature is intended to convey highlighting information, rather than a full semantic description of the document. The server therefore has to make some decisions as to what constitutes sensible highlighting. Personally I can't see many use cases in which a user wants to highlight, for example, a pair of parentheses differently to a pair of square brackets. Maybe the approach to take is an overall slimming down of the information sent to the client?

(NB: #12783 reached a similar conclusion)

@Veykril
Copy link
Member

Veykril commented Aug 22, 2022

A point to be made is that the feature is intended to convey highlighting information, rather than a full semantic description of the document

I don't really see the point made here, full semanticdescription is highlighting information so what the server does seems like a correct choice to me.

Personally I can't see many use cases in which a user wants to highlight, for example, a pair of parentheses differently to a pair of square brackets.

That I can't really tell you either, I am not using the different punctuation tags, others are as they were requested, see #6152 and #8279.

(NB: #12783 reached a similar conclusion)

I am all in favor of making more parts of this infra configurable, since there are a lot of tags that are not useful for the general userbase and a lot of clients seem to struggle with too many tokens being sent. But overall I would dislike us removing features from our syntax highlighting fully as there are clearly people that want these.

And to reiterate regarding the main part of this issue, I do not think restricting ourselves to the standard token types is a good idea, for one as outlined we can only have up to 32 modifiers and secondly there is no standard token set that describes all the languages (which is why extensibility by the protocol is even a thing).

@frankplow
Copy link
Author

frankplow commented Aug 22, 2022

Sorry maybe I didn't articulate myself too well, all I meant to say is that the specification is quite restrictive and so the server must make compromises.

I don't think that the set of token types needs to be limited to only the standard set, just reduced. As you say, the set of standard types is not a one-size-fits-all solution and there are certain Rust features such as lifetimes which I think warrant their own types. At the moment however, rust-analyzer provides 35 non-standard types. For comparison, clangd provides 2 and pylance provides 4.

A lot of these non-standard types come from the distinct operator and punctuation types which I didn't realise have come from feature requests so yes I agree they should not be removed. I think server configuration options are a good compromise of flexibility and ease-of-use.

To illustrate why I think having so many non-standard token types is somewhat of an issue, here is a snippet of my coc configuration:

vim.api.nvim_command('hi link CocSemAngle CocSemPunctuation')
vim.api.nvim_command('hi link CocSemArithmetic CocSemOperator')
vim.api.nvim_command('hi link CocSemAttribute CocSemDecorator')
vim.api.nvim_command('hi link CocSemAttributeBracket CocSemDecorator')
vim.api.nvim_command('hi link CocSemBitwise CocSemOperator')
vim.api.nvim_command('hi link CocSemBoolean Boolean')
vim.api.nvim_command('hi link CocSemBrace CocSemPunctuation')
vim.api.nvim_command('hi link CocSemBracket CocSemPunctuation')
vim.api.nvim_command('hi link CocSemBuiltinAttribute CocSemAttribute')
vim.api.nvim_command('hi link CocSemBuiltinType CocSemType')
vim.api.nvim_command('hi link CocSemCharacter CocSemString')
vim.api.nvim_command('hi link CocSemColon CocSemPunctuation')
vim.api.nvim_command('hi link CocSemComma CocSemPunctuation')
vim.api.nvim_command('hi link CocSemComparison CocSemOperator')
vim.api.nvim_command('hi link CocSemConstParameter CocSemParameter')
vim.api.nvim_command('hi link CocSemDerive CocSemAttribute')
vim.api.nvim_command('hi link CocSemDeriveHelper CocSemAttribute')
vim.api.nvim_command('hi link CocSemDot CocSemPunctuation')
vim.api.nvim_command('hi link CocSemEscapeSequence CocSemString')
vim.api.nvim_command('hi link CocSemFormatSpecifier CocSemString')
vim.api.nvim_command('hi link CocSemGeneric CocSemUnknown')
vim.api.nvim_command('hi link CocSemLabel CocSemMacro')
vim.api.nvim_command('hi link CocSemLifetime CocSemTypeParameter')
vim.api.nvim_command('hi link CocSemLogical CocSemOperator')
vim.api.nvim_command('hi link CocSemMacroBang CocSemMacro')
vim.api.nvim_command('hi link CocSemParenthesis CocSemPunctuation')
vim.api.nvim_command('hi clear CocSemPunctuation')
vim.api.nvim_command('hi link CocSemSelfKeyword CocSemParameter')
vim.api.nvim_command('hi link CocSemSelfTypeKeyword CocSemType')
vim.api.nvim_command('hi link CocSemSemicolon CocSemPunctuation')
vim.api.nvim_command('hi link CocSemTypeAlias CocSemType')
vim.api.nvim_command('hi link CocSemToolModule CocSemNamespace')
vim.api.nvim_command('hi link CocSemUnion CocSemStruct')
vim.api.nvim_command('hi link CocSemUnresolvedReference Warning')

Without all this, large areas of the document do not get highlighted correctly, as the token types are unrecognised. Defaults similar to this could be provided by coc-rust-analyzer, but having to provide distributions like this for every LSP client is a lot of work.

@Veykril
Copy link
Member

Veykril commented Aug 22, 2022

Ye, I didn't quite express myself either properly I think. I'll look into this a bit more and fix up a PR for settings regarding "unspecializing" punctuations and operators, and making use of decorator for attributes for now. Given from your list, the operators and punctuation ones are the majority of things it seems, and the easiest to fix. Will come back to this afterwards.

@frankplow
Copy link
Author

frankplow commented Aug 22, 2022

@Veykril decorator will need this to be merged first I think: gluon-lang/lsp-types#246, also that repo still hasn't been updated to make 3.17 stable so it's behind a configuration flag.

EDIT: also operator is duplicated between the non-standard types and standard types at the minute. Not sure if there's really an issue with this (actually doing this may be an alternative if lsp-types does not update soon) but it could potentially be removed from the non-standard set here.

@Veykril
Copy link
Member

Veykril commented Aug 22, 2022

Ye, that's not an issue. We can just duplicate decorator for now like its done with operator. While we are at it we can also remove the operator dupe then.

@Veykril
Copy link
Member

Veykril commented Aug 22, 2022

#13084 now adds some config for the mentioned ones (+ the decorator change) with some disabled by default.

@Veykril
Copy link
Member

Veykril commented Aug 22, 2022

Okay, so with those settings in mind, it boils down to the following custom types we have:

ATTRIBUTE_BRACKET - the #[ ] parts of attributes
BUILTIN_ATTRIBUTE - builtin attributes #[cold]
BUILTIN_TYPE - builtin types usize
CHAR - 'c'
CONST_PARAMETER - const param defs, and references
DERIVE - derive attribute references
DERIVE_HELPER - derive helper references
ESCAPE_SEQUENCE - string escapes
FORMAT_SPECIFIER - format! specifiers
GENERIC - things that have no proper tag (this is a curious case, do we even emit these in the end? why would we, this seems like an implementation detail where we shouldn't emit a tag at all maybe?)
LABEL - loop labels
LIFETIME - lifetimes
SELF_KEYWORD - self when used as the module or local (this should actually be 2 things even imo)
SELF_TYPE_KEYWORD - Self
TYPE_ALIAS - type alias
TOOL_MODULE - tool modules, unstable feature. #[rustfmt::foo] is such a tool module
UNION - union references (how is this not a standard token type???)
UNRESOLVED_REFERENCE - unresolved references (this one is simple, we should make this configurable)

the builtin ones I think we can turn around with a modifier as you proposed, having builtin be a modifier (since defaultLibrary has a different meaning).
UNRESOLVED_REFERENCE should just have a config to disable it.
GENERIC I'll have to check what the use of it is, I'm unsure right now but we might use it to explicitly overwrite highlights in macro calls.
ATTRIBUTE_BRACKET should probably be moved into punctuations, in which case it can be disabled (though I think it should have a separate config as well, like I made for the macro call bang).
derive and derive helper could probably be merged (the split doesn't seem to be too useful after all), though I am unsure whether turning that into a modifier is nice, attributes would then be the only thing tagged with this.
Rest I would need to think about a bit more.

Using modifiers to specialize one concrete entity (like the derive case with attributes) seems like the wrong usage of modifiers to me which is why I am so stubborn here :D But we also don't really have a lot of alternatives right now in this regard unfortunately. I'll open an issue on https://github.com/microsoft/language-server-protocol in a bit in regards to the problems with semantic highlighting we have here.

@lnicola
Copy link
Member

lnicola commented Aug 22, 2022

I think like a limited set of custom types (with extra modifiers, in the CSS sense, since LSP is somewhat ambiguous) more. Do we really need both CHAR and STRING? Can't Self be just a TYPE? TOOL_MODULE could be DECORATOR, UNION could be ENUM, and so on.

I'm more surprised that LABEL isn't a standard type.

@Veykril
Copy link
Member

Veykril commented Aug 22, 2022

Can't Self be just a TYPE?

Well, I color Self as a keyword, while I imagine others might wanna color it like a type. The reason why I think having a lot of different tags is beneficial is because some people want to specifically color certain things in various ways (the punctutation and operator variations are a good indicator I think). The modifier approach could work, but going with that route we will easily hit the 32 modifier limit in the near future I imagine.

@frankplow
Copy link
Author

I think of Self as a type and self as a parameter/module depending on context. Maybe a self or special modifier could be added - I think these are important enough to get their own modifier, and it's not quite narrowing down to only a single language construct. It would allow either the user to use either the keyword highlighting or the type/parameter/module highlighting, as well as providing a distinction between self as a parameter and self as a module which is not available at the moment.

The LSP set has some really suprising omissions, even boolean isn't standard.

bors added a commit that referenced this issue Aug 23, 2022
Add some more highlighting configurations

The following can be enabled/disabled now in terms of highlighting:
- doc comment injection (enabled by default)
- punctuation highlighting (disabled by default)
- operator highlighting (enabled by default)
- punctuation specialized highlighting (disabled by default)
- operator specialized highlighting (disabled by default)
- macro call bang highlighting (disabled by default)

This PR also changes our `attribute` semantic token type to the `decorator` type which landed upstream (but not yet in lsp-types).

Specialized highlighting is disabled by default, as all clients will have to ship something to map these back to the standard punctuation/operator token (we do this in VSCode via the inheritance mapping for example). This is a lot of maintenance work, and not something every client wants to do, pushing that need to use the user. As this is a rather niche use in the first place this will just be disabled by default.

Punctuation highlighting is disabled by default, punctuation is usually something that can be done by the native syntactic highlighting of the client, so there is no loss in quality. The main reason for this though is that punctuation adds a lot of extra token data that we sent over, a lot of clients struggle with applying this, so disabling this improves the UX for a lot of people. Note that we still highlight punctuations with special meaning as that special entity, (the never type `!` will still be tagged as a builtin type if it occurs as such)

Separate highlighting of the macro call bang `!` is disabled by default, as I think people actually didn't like that change that much, though at the same time I feel like not many people even noticed that change (I prefer it be separate, but that's not enough reason for it to be enabled by default I believe :^) )

cc #12783 #13066
@Veykril
Copy link
Member

Veykril commented Aug 23, 2022

The standard types in the LSP are less than bare bones tbh yes ...

I guess going with the modifiers is the best we can do as I don't think a viable alternative will appear within the coming years, if we run into the limit at some point then so be it but I assume for now we should be able to fit all the things with some room to spare(given we ignore the variations of punctuation and operators).

I love semantic highlighting but I feel like the protocol vastly limited its own functionality for no apparent reason.

@ghost
Copy link

ghost commented Sep 30, 2022

Hey guys, don't know if I should open up a separate issue for this, but currently, there's no way to distinguish macro!() calls and the macro_rules! bang ("!") at the end of the macro_rules, which itself is of type keyword. This makes for weird highlighting for macro_rules! definitions.

@Veykril
Copy link
Member

Veykril commented Sep 30, 2022

That sounds like it deserves a separate issue, so please open one for that (also please elaborate a bit on what exactly you want there, I am not quite sure I completely understand)

@jhpratt
Copy link
Member

jhpratt commented Nov 23, 2022

Is the limit of 32 modifiers per token or total? Separately, is the limit on r-a's end or LSP's?

In my opinion switching to have lots of modifiers would be the best solution, pending a proper hierarchy on the LSP side of things. If it's possible to implement it for just VS Code, that could work too (though I am biased as I use VS Code).

@oxalica
Copy link
Contributor

oxalica commented Nov 23, 2022

@jhpratt

Is the limit of 32 modifiers per token or total? Separately, is the limit on r-a's end or LSP's?

LSP Spec uses bitset for modifiers but doesn't explicitly mention the max number of modifiers. But LSP's integer is defined to be i32, so 32 (or 31?) is the technical limit. See Semantic Tokens and Base Types - integer.

@Veykril
Copy link
Member

Veykril commented Nov 23, 2022

Note that VSCode shouldn't have an issue with unset token types not being colored, since VSCode natively supports token type hierarchy as a fallback (which we are populating).

@jhpratt
Copy link
Member

jhpratt commented Nov 23, 2022

Hmm...LSP uses TypeScript for its implementation? JS has bigint support natively, so theoretically it could be unbounded if that were used. Not sure if that's a change they'd be interested in making, however.

When looking to see what tokens had what semantics attached yesterday, I did notice that TextMate scopes were quite a bit more detailed in some cases. Perhaps if there was some way to tell the editor to override semantic highlighting for certain TextMate scopes, that could work as well?

And for the token type hierarchy, is there any way to take advantage of that today?

@Veykril
Copy link
Member

Veykril commented Nov 23, 2022

When looking to see what tokens had what semantics attached yesterday, I did notice that TextMate scopes were quite a bit more detailed in some cases.

Can you elaborate on that? In general semantic tokens can be more precise than textmate scopes (as the latter is just regex), if there are some provided by r-a where that is not the case I would consider that a bug.

And for the token type hierarchy, is there any way to take advantage of that today?

If you are using VSCode you are by default taking advantage of it, though we might be missing some hierarchy definitions (the hierarchy can also specifically delegate back to textmate scope fwiw).

@oxalica
Copy link
Contributor

oxalica commented Nov 24, 2022

Hmm...LSP uses TypeScript for its implementation? JS has bigint support natively, so theoretically it could be unbounded if that were used. Not sure if that's a change they'd be interested in making, however.

No. We should follows the protocol definition in text, not any JS or TS implementation. Going beyond the protocol will break support of other clients that strictly following the protocol (like, already using i32 in Rust clients), which is bad for the ecosystem.

IMO, we don't really need 32 modifiers. Semantic tokens are not about distinguish each individual tokens, and themes don't usually provide these many different colors. There are currently too many punctuation kinds.

@Veykril
Copy link
Member

Veykril commented Nov 24, 2022

We should just have a setting that makes r-a map the custom types to the standard ones I think, and then probably make that enabled by default. It is a shortcoming of the LSP that non-standard tokens are this painful to use in clients that don't have this hierarchy concept I think. It's a shame given how useful semantic tokens can be.

@jhpratt
Copy link
Member

jhpratt commented Nov 27, 2022

Can you elaborate on that? In general semantic tokens can be more precise than textmate scopes (as the latter is just regex), if there are some provided by r-a where that is not the case I would consider that a bug.

Even targeting specific keywords is an area where TextMate has more detail. struct is keyword.other.struct.rust for TextMate, while r-a just classifies it as keyword.other.rust. While I didn't intend on targeting this specific keyword, I did want to at least experiment with it for a couple others.

No. We should follows the protocol definition in text

Yeah, I was suggesting that LSP could make a change, as it doesn't appear specified anywhere.

themes don't usually provide these many different colors.

At least in my case, that's not the concern. I'm more interested in changing the color to match something else.

@Veykril
Copy link
Member

Veykril commented Nov 28, 2022

Even targeting specific keywords is an area where TextMate has more detail

Then we are missing those details, we don't necessarily add new token types unless someone needs it, as adding a type for each keyword would certainly be too much eventually.

themes don't usually provide these many different colors

Also regarding this, standard themes don't provide colors for these specific tokens usually. That doesn't mean they aren't useful. The punctuations were requested by someone, because they want to make use of them. Likewise a lot of these more specific token types I personally use in my own theme because I want to leverage the more precise semantic token types. (Nevertheless I see the problem with non-standard token types which is why I think we should just add what I said earlier now, We should just have a setting that makes r-a map the custom types to the standard ones I think, and then probably make that enabled by default.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-highlighting (semantic) token highlighting A-lsp LSP conformance issues and missing features C-enhancement Category: enhancement
Projects
None yet
6 participants