combining emoji aren't correctly sized #1329

dankamongmen · 2021-02-05T02:32:03Z

We don't seem to get correct sizing for combining emoji. Take, for instance, "dark-toned woman mechanic" aka 👩🏾‍🔧 aka U+1F469 WOMAN + U+1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 + U+200D ZERO WIDTH JOINER + U+1F527 WRENCH aka 👩+🏾‍+🔧. This shows up as three wide glyphs in xfce4-terminal and st, and a single wide glyph in kitty, so that's one thing to deal with (kitty is the correct one here). ncwidth follows dumbly along with classic wcwidth():

[schwarzgerat](0) $ ./ncwidth 👩🏾‍🔧
0x1f469: 2 👩	0x1f3fe: 2 🏾	0x0200d: 0 ‍	0x1f527: 2 🔧	

 total width: 6  total bytes: 15

👩🏾‍🔧
012345
[schwarzgerat](0) $

so we're going to count this as six columns wide in all cases. This is probably related to why we get the clear area of spillover in mojibake when running under kitty:

so (assuming wcswidth() gets this wrong, also -- check that out), we need to either (a) start parsing up a bunch of extra Unicode data files ourselves (as the jquast/wcwidth project does, see jquast/wcwidth#39), or (b) send patches to the relevant libcs (if this behavior is even compatible with the ANSI C wcswidth() definition), or (c) find a suitably fast wcswidth() alternative (ideally with a native C interface).

The text was updated successfully, but these errors were encountered:

dankamongmen · 2021-02-05T03:00:10Z

I reached out to @jquast to see if he would find such a thing to be of use. I also reached out to Daniel Lemire, author of simdjson, to see whether he's done anything in this space -- I would likely use simdjson-pioneered techniques to implement this, if it comes down to me.

kovidgoyal · 2021-02-05T07:39:12Z

FYI kitty already has code to automatically generate wcswidth from the unicode standard. It's been on my todo list to factor that out into its own C library, but never been motivated enough. I cant say I think wcswidth is enough of a bottleneck to bother with SIMD though. But hey if you want to do that, I wont stop you :)

dankamongmen · 2021-08-05T04:11:05Z

i have been rudely surprised by the discovery that wcwidth() is a POSIX function, not an ANSI C one, which makes sense when you think about it, but is rather inconvenient. I'm not much impacted by its absence on GNU Hurd, but the lack of wcwidth() on Windows is going to be an official Problem.

on the other hand, that opens the gates for a portable, high-quality implementation. and this, unlike notcurses proper, ought feel free to reach into font tables and other grime if/as necessary.

additionally, this could carry information about glyphs which have different behavior in different terminals, and thus encourage use of a cursor location report. just throwing out ideas. either way, we're gonna need to do better on Windows than our current

#define wcwidth(x) 1

heh

jquast · 2021-08-05T13:35:34Z

I think we should do this. I saw a C-bindings alternative release this year and it makes my heart sink a little, the posix C bindings are so awful, but not so slow! The readme there shows timeit tests of 20x improvement. https://github.com/sebastinas/cwcwidth

I’d really like to try to make wcwidth generate C code by end of year and be a drop-in replacement for existing uses of the library. also, I’d like to introduce a new easier API function that doesn’t return -1, yikes.

dankamongmen · 2021-08-06T21:52:39Z

I think we should do this. I saw a C-bindings alternative release this year and it makes my heart sink a little, the posix C bindings are so awful, but not so slow! The readme there shows timeit tests of 20x improvement. https://github.com/sebastinas/cwcwidth

so i did some basic calculations assuming a straight up flat uint32_t-indexed O(1) data structure, the fastest thing possible (ignoring cache effects for now).

17 * 2^16 == 1114112 (1Mi + 64Ki), so less than 2MiB at a byte per codepoint
it would be nice to have some properties encoded along with the width
there are six bidi properties, but bidi's rare enough that maybe it's better to just have 1 bit devoted to "bidi-affected", and go look that up in another structure. these are otherwise independent, so you need at least 6 bits.
2 bits for numeric type (4 classifications)
4 types of boundary, not sure about the relation between them but probably somewhat independent
whitespace is a single bit
4 bits for punctuation classes

assuming an arbitrarily-aligned lookup table and 64-byte cachelines, you're gonna be able to load up all of ascii in 2 lines iff you do a byte per codepoint. the BMP alone would be 1Ki cachelines, probably blowing out a single core's L1. i don't think you can reasonably go below 3 bits for width, and i really think 4 is a better idea. so can you even get to a byte? not unless you really want to cut out other properties. so figure at least a byte, maybe 2 per codepoint.

but, we only actually use 5 of 17 planes, so with a O(1) int->int map there we get an offset for our page within a structure leaving out the 12 unused planes. now we're talking 640KiB at a byte per. now we're talking 5120 cache lines for all of unicode rather than 17408.

it would be best if we were not computing any of this table at startup, so that everything can be demand-paged, and ideally we never use much of the table at all.

kovidgoyal · 2021-08-07T03:26:45Z

Dont use a table. At least for width there are vast ranges of the space
that all have the same width value. And the overwhelmingly common case
is using simple ascii chars with width 1. So a switch with with a if for
the ascii case is the most efficient implementation, given that branch
prediction will rarely miss. This is actually true of most unicode
properties so in kitty I just use switch with if for common cases for
pretty much all unicode properties I care about.

dankamongmen · 2021-08-13T08:00:12Z

whatever is done, i want an API that i can feed a sequence of utf-8 or utf-32 and have segmentation occur along with column approximation, and furthermore it needs be reentrant in the sense that

i get a bit back telling me whether this is absolutely the end of the EGC (implying if not set that it matches the prefix of some longer EGC)
i can call with such a prefix plus new data (or ideally just the new data) and get the new combined EGC, or a new EGC based off the new data only, so another bit here (or api design around it, etc)

dankamongmen added the bug Something isn't working label Feb 5, 2021

dankamongmen added this to the 2.3.0 milestone Feb 5, 2021

dankamongmen self-assigned this Feb 5, 2021

dankamongmen added a commit that referenced this issue Feb 5, 2021

PoC ncwidth: do a wcswidth() verify pass #1329

db7281a

dankamongmen mentioned this issue Jul 3, 2021

Why ncplane_putegc() needs termination character? #1871

Closed

dankamongmen mentioned this issue Aug 2, 2021

notcurses-info should print a few multipoint EGCs #2005

Closed

dankamongmen removed this from the 3.0.0 milestone Oct 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

combining emoji aren't correctly sized #1329

combining emoji aren't correctly sized #1329

dankamongmen commented Feb 5, 2021

dankamongmen commented Feb 5, 2021

kovidgoyal commented Feb 5, 2021

dankamongmen commented Aug 5, 2021

jquast commented Aug 5, 2021

dankamongmen commented Aug 6, 2021

kovidgoyal commented Aug 7, 2021

dankamongmen commented Aug 13, 2021

combining emoji aren't correctly sized #1329

combining emoji aren't correctly sized #1329

Comments

dankamongmen commented Feb 5, 2021

dankamongmen commented Feb 5, 2021

kovidgoyal commented Feb 5, 2021

dankamongmen commented Aug 5, 2021

jquast commented Aug 5, 2021

dankamongmen commented Aug 6, 2021

kovidgoyal commented Aug 7, 2021

dankamongmen commented Aug 13, 2021