Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combining emoji aren't correctly sized #1329

Open
dankamongmen opened this issue Feb 5, 2021 · 7 comments
Open

combining emoji aren't correctly sized #1329

dankamongmen opened this issue Feb 5, 2021 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@dankamongmen
Copy link
Owner

We don't seem to get correct sizing for combining emoji. Take, for instance, "dark-toned woman mechanic" aka πŸ‘©πŸΎβ€πŸ”§ aka U+1F469 WOMAN + U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 + U+200D ZERO WIDTH JOINER + U+1F527 WRENCH aka πŸ‘©+πŸΎβ€+πŸ”§. This shows up as three wide glyphs in xfce4-terminal and st, and a single wide glyph in kitty, so that's one thing to deal with (kitty is the correct one here). ncwidth follows dumbly along with classic wcwidth():

[schwarzgerat](0) $ ./ncwidth πŸ‘©πŸΎβ€πŸ”§
0x1f469: 2 πŸ‘©	0x1f3fe: 2 🏾	0x0200d: 0 ‍	0x1f527: 2 πŸ”§	

 total width: 6  total bytes: 15

πŸ‘©πŸΎβ€πŸ”§
012345
[schwarzgerat](0) $ 

so we're going to count this as six columns wide in all cases. This is probably related to why we get the clear area of spillover in mojibake when running under kitty:

2021-02-04-212711_802x1417_scrot

so (assuming wcswidth() gets this wrong, also -- check that out), we need to either (a) start parsing up a bunch of extra Unicode data files ourselves (as the jquast/wcwidth project does, see jquast/wcwidth#39), or (b) send patches to the relevant libcs (if this behavior is even compatible with the ANSI C wcswidth() definition), or (c) find a suitably fast wcswidth() alternative (ideally with a native C interface).

@dankamongmen dankamongmen added the bug Something isn't working label Feb 5, 2021
@dankamongmen dankamongmen added this to the 2.3.0 milestone Feb 5, 2021
@dankamongmen dankamongmen self-assigned this Feb 5, 2021
@dankamongmen
Copy link
Owner Author

I reached out to @jquast to see if he would find such a thing to be of use. I also reached out to Daniel Lemire, author of simdjson, to see whether he's done anything in this space -- I would likely use simdjson-pioneered techniques to implement this, if it comes down to me.

@kovidgoyal
Copy link

FYI kitty already has code to automatically generate wcswidth from the unicode standard. It's been on my todo list to factor that out into its own C library, but never been motivated enough. I cant say I think wcswidth is enough of a bottleneck to bother with SIMD though. But hey if you want to do that, I wont stop you :)

@dankamongmen
Copy link
Owner Author

i have been rudely surprised by the discovery that wcwidth() is a POSIX function, not an ANSI C one, which makes sense when you think about it, but is rather inconvenient. I'm not much impacted by its absence on GNU Hurd, but the lack of wcwidth() on Windows is going to be an official Problem.

on the other hand, that opens the gates for a portable, high-quality implementation. and this, unlike notcurses proper, ought feel free to reach into font tables and other grime if/as necessary.

additionally, this could carry information about glyphs which have different behavior in different terminals, and thus encourage use of a cursor location report. just throwing out ideas. either way, we're gonna need to do better on Windows than our current

#define wcwidth(x) 1

heh

@jquast
Copy link

jquast commented Aug 5, 2021

I think we should do this. I saw a C-bindings alternative release this year and it makes my heart sink a little, the posix C bindings are so awful, but not so slow! The readme there shows timeit tests of 20x improvement. https://github.com/sebastinas/cwcwidth

I’d really like to try to make wcwidth generate C code by end of year and be a drop-in replacement for existing uses of the library. also, I’d like to introduce a new easier API function that doesn’t return -1, yikes.

@dankamongmen
Copy link
Owner Author

I think we should do this. I saw a C-bindings alternative release this year and it makes my heart sink a little, the posix C bindings are so awful, but not so slow! The readme there shows timeit tests of 20x improvement. https://github.com/sebastinas/cwcwidth

so i did some basic calculations assuming a straight up flat uint32_t-indexed O(1) data structure, the fastest thing possible (ignoring cache effects for now).

  • 17 * 2^16 == 1114112 (1Mi + 64Ki), so less than 2MiB at a byte per codepoint
  • it would be nice to have some properties encoded along with the width
  • there are six bidi properties, but bidi's rare enough that maybe it's better to just have 1 bit devoted to "bidi-affected", and go look that up in another structure. these are otherwise independent, so you need at least 6 bits.
  • 2 bits for numeric type (4 classifications)
  • 4 types of boundary, not sure about the relation between them but probably somewhat independent
  • whitespace is a single bit
  • 4 bits for punctuation classes

assuming an arbitrarily-aligned lookup table and 64-byte cachelines, you're gonna be able to load up all of ascii in 2 lines iff you do a byte per codepoint. the BMP alone would be 1Ki cachelines, probably blowing out a single core's L1. i don't think you can reasonably go below 3 bits for width, and i really think 4 is a better idea. so can you even get to a byte? not unless you really want to cut out other properties. so figure at least a byte, maybe 2 per codepoint.

but, we only actually use 5 of 17 planes, so with a O(1) int->int map there we get an offset for our page within a structure leaving out the 12 unused planes. now we're talking 640KiB at a byte per. now we're talking 5120 cache lines for all of unicode rather than 17408.

it would be best if we were not computing any of this table at startup, so that everything can be demand-paged, and ideally we never use much of the table at all.

@kovidgoyal
Copy link

Dont use a table. At least for width there are vast ranges of the space
that all have the same width value. And the overwhelmingly common case
is using simple ascii chars with width 1. So a switch with with a if for
the ascii case is the most efficient implementation, given that branch
prediction will rarely miss. This is actually true of most unicode
properties so in kitty I just use switch with if for common cases for
pretty much all unicode properties I care about.

@dankamongmen
Copy link
Owner Author

whatever is done, i want an API that i can feed a sequence of utf-8 or utf-32 and have segmentation occur along with column approximation, and furthermore it needs be reentrant in the sense that

  • i get a bit back telling me whether this is absolutely the end of the EGC (implying if not set that it matches the prefix of some longer EGC)
  • i can call with such a prefix plus new data (or ideally just the new data) and get the new combined EGC, or a new EGC based off the new data only, so another bit here (or api design around it, etc)

@dankamongmen dankamongmen removed this from the 3.0.0 milestone Oct 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants