-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Line length should consider Unicode character width #1197
Comments
👍 and I'd be willing to try and implement it if maintainers agree they want the feature. |
Related @ maintainers: Would you also want zero-width characters to not add to line length? |
This makes sense to me in general, but I'm not familiar with the Unicode subtleties involved here. I think the relevant question for computing line length is how good, Unicode-aware code editors display these characters. So if zero-width characters are actually normally displayed as zero-width, we should count them as zero-width. I'm also concerned about whether this change could affect performance for the common case of having only normal-width characters. |
This is partially a chicken-and-egg problem: In a culture where it would make sense to use lots of fullwidth characters, a lack of tooling support might nudge people to making their code base more "common" in the meaning of "more American". |
Still it doesn't change the fact, even in an asian python code base, there will be lots of lines with just American characters. We could, as a first mitigation, scan the line for whether there are any multibyte chars in them at all. Maybe I can even optimize it so only the multibyte chars are checked, not sure whether that will really be faster. Another optimization could be to not actually check each char but only larger chunks so we could make a regexp range. I'll go research Unicode to be able to give better feedback on this. |
(Sorry to anyone annoyed by my multi-replies; I do it so people can vote on them separately.)
That differs widely. As I understand the black philosophy, I assume the black answer would be "Anyone who uses such characters should just get editors that can deal with them properly." However, I personally tend to have more of a lenient "let them configure stuff" attitude. Sometimes all the editors that manage to get the one feature right, are lacking lots of other features. And support for asian-specific bugs is (in my view) very lacking in lots of free software projects, so I think it would not be helpful to present users with yet another case of "your problem almost could have been solved, but…". |
Thanks for the discussion. Some of my own observations:
The support for EAW (i.e. single vs double width) is generally quite good. All GUI (including web) ones are very good at this. CLI editors generally rely on the terminal’s support, which is also quite universal these days, since it affects emojis as well. Support is variant for characters with ambigious or unspecified length. with most engines displaying them as single-width (even zero-width characters). But IMO this is not a problem from a practical standpoint. People tend to care less (at least I do) when the line wraps too early (i.e. break at 86 instead of 88 characters), and it’s unlikely for a source have a lot of zero-width characters in your string[*]. EAW characters, however, make the line wrap too late, with extremely long lines that cause line overflow or soft wrapping, both very noticible and with significant usability issues. [*]: Non-printing characters are not a good idea anyway since they are inherently hard to spot, causing difficulties in maintainance. It is more advised to use escape code instead. Combining characters should be counted as zero-width when they are combining in an ideal world, but they are nore difficult since the length depends on context. And again, this is less of a problem in practice since it only causes early wraps, not late.
I understand what you’re trying to say, but this expression is slightly, um, annoying (avoiding the r word here). Hopefully you don’t really mean our native languages are not as “normal” as English? 🙂 |
If we simplify the problem to the subset relevant for real-life use cases, and…
… simplify further to accept some amount of too-early line break, I guess we'll end up with a relatively small set of character range expressions and an almost trivial implementation as regexp or filtered list comprehension, whichever will turn out more performant. Next we'd need to decide which real-life codebases we want to use for the performance testing. |
I was just bitten by this issue with composing characters. I assume this happens with languages that have accents that don't have corresponding unified accented symbols. In this case, while looking to implement missing symbols from the IBM 3270 terminal, I implemented numbers with dotted overlines that are sometimes displayed in their status line with the "combining four dots above" symbol (u+20dc). In the example below, the line is considered too long because every combined character is counted as two. composed_glyphs = "Composed glyphs: ÀÉI͂o̓N̈́AͅB̊͆Ȍ͇U͈D̈ẢB̊A̋ĎA̍J̎Ȁ1⃜2⃜3⃜4⃜5⃜6⃜7⃜8⃜9⃜0⃜\n" I solved it by breaking the line, but I wouldn't say it's pretty. This change to my code is in commit rbanffy/3270font@710719f I'd love to help fixing this. Is anyone working on it? |
In case anyone hoped for me, because I was very interested back then – nope, I ditched black meanwhile. |
I tried to address this (partly): #3445. However, I am not sure if my approach is appropriate and I implemented things properly. I would appreciate if someone take a look at my patch. Thanks! |
This patch changes the preview style so that string splitters respect Unicode East Asian Width[^1] property. If you are not familiar to CJK languages it is not clear immediately. Let me elaborate with some examples. Traditionally, East Asian characters (including punctuation) have taken up space twice than European letters and stops when they are rendered in monospace typeset. Compare the following characters: ``` abcdefg. 글、字。 ``` The characters at the first line are half-width, and the second line are full-width. (Also note that the last character with a small circle, the East Asian period, is also full-width.) Therefore, if we want to prevent those full-width characters to exceed the maximum columns per line, we need to count their *width* rather than the number of characters. Again, the following characters: ``` 글、字。 ``` These are just 4 characters, but their total width is 8. Suppose we want to maintain up to 4 columns per line with the following text: ``` abcdefg. 글、字。 ``` How should it be then? We want it to look like: ``` abcd efg. 글、 字。 ``` However, Black currently turns it into like this: ``` abcd efg. 글、字。 ``` It's because Black currently counts the number of characters in the line instead of measuring their width. So, how could we measure the width? How can we tell if a character is full- or half-width? What if half-width characters and full-width ones are mixed in a line? That's why Unicode defined an attribute named `East_Asian_Width`. Unicode grouped every single character according to their width in fixed-width typeset. This partially addresses #1197, but only for string splitters. The other parts need to be fixed as well in future patches. This was implemented by copying rich's own approach to handling wide characters: generate a table using wcwidth, check it into source control, and use in to drive helper functions in Black's logic. This gets us the best of both worlds: accuracy and performance (and let's us update as per our stability policy too!). Co-authored-by: Jelle Zijlstra <[email protected]>
Is your feature request related to a problem? Please describe. A clear and concise
In some cases one would want to put non-ASCII characters in source. This is most common
for strings, but in Python 3 you can use non-ASCII identifiers as well.
But consider the following example:
The
default
line is visually long, and I’d like to put it on its own line. But Black would insist formatting it intosince it thinks there is enough space (indeed, the formatted line is only 73 characters long).
Describe the solution you'd like
When calculating line lengths, Black should consult the width attribute of each character, and count double width for characters with EAW F or W.
Describe alternatives you've considered
Having strictly single-width characters in source, and pull in double-width strings at runtime (via i18n or
\u
sequences. The translation system is usually an overkill cases (e.g. developing an in-house CLI where all users speak a common East Asian language), and replacing all strings with\u
make the source illegible to humans.Usgin
fmt: on/off
becomes tedious and hard to the eye quickly when you need a lot of string scattered around (e.g. for logging purposes).Additional context Add any other context or screenshots about the feature request
here.
The text was updated successfully, but these errors were encountered: