-
-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why ncplane_putegc() needs termination character? #1871
Comments
In this case, the wiki is talking about a sequence of Unicode code points that form a grapheme cluster. single UTF-8 codepoint = byte + byte + ... + byte grapheme cluster = EGC = codepoint + ... + codepoint + string_terminator in expanded form: UTF-8: wchar_t: |
Okay, my bad, I mixed up two things in here. Let me put it this way:
I have a byte stream that I'm trying to parse, including UTF-8 support. It seems I cannot just point putegc() to a beginning of UTF-8 codepoint in a buffer. Instead I'll have to copy it first, terminate with /0, and then call putegc() on it. Is that correct? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
You can send the entire UTF-8 string, these functions will bite off exactly as many bytes from the beginning of the string as the length of the grapheme cluster. These functions
Lines 1632 to 1640 in 1287d8b
return in the sbytes variable how much they have bitten off from the beginning of the line. Lines 63 to 78 in 1287d8b
|
Responsible for interruption at the end of the grapheme cluster is the following check Lines 83 to 85 in 1287d8b
It turns out that the string terminator is needed only at the end of the entire UTF-8 line. |
gclust = UTF-8 line = (EGC1) + ... + (EGCn) + string_terminator where after calling after that you can gclust = (EGC2) + ... + (EGCn) + string_terminator |
I have an SSH stream (or any other terminal stream) that is a mixture of plain text, terminal escape codes and possibly UTF-8 EGCs here and there. I'm making a parser for whatever I may stumble across. What I want to do is to recognize UTF-8 codepoints, so these are properly rendered, and properly handled if they happen to be split across buffer iterations. So, no, I'm not dealing with well defined UTF-8 string or anything like that. I need an answer to a question exactly like it has been asked. If my parser detects UTF-8 EGC, I have to put EGC into the ncplane, and I need to know the most efficient way for the sake of |
i was unaware that the talented @o-sdn-o was reading our bugs, but welcome! and he is correct. an individual encoded Unicode codepoint can be lexed without explicit termination, but this is not true for EGCs. |
The value of the first byte in the string determines what to retrieve first. If the first byte is a control byte, treat it and the necessary subsequent bytes as an escape sequence. Otherwise, this is the beginning of a grapheme cluster. |
It is important to keep in mind the following points:
|
@tomek-szczesny we've got functions for most of this kind of thing. what you almost certainly want for this is to use |
the following will take an input stream and spit it out as EGCs to a plane (not tested): // returns columns consumed, or -1 on invalid EGC / out of output space
int spray_utf8_egcs(const char* utf8text, struct ncplane* n, int* sbytes){
int cols = 0;
*sbytes = 0;
while(*utf8text){
int b, c;
if((c = ncplane_putegc(n, utf8text, &b)) <= 0){
return -1;
}
utf8text += b;
*sbytes += b;
cols += c;
}
return cols;
} |
@tomek-szczesny , this is probably the easiest and fastest way to do what you want, if i understand you correctly. most of what you're doing could pretty much be this function plus an escape check prior to the |
a good point is raised here -- all the // is it a control character? check C0 and C1, but don't count empty strings,
// nor single-byte strings containing only a NUL character.
static inline bool
is_control_egc(const unsigned char* egc, int bytes){
if(bytes == 1){
if(*egc && iscntrl(*egc)){
return true;
}
}else if(bytes == 2){
// 0xc2 followed by 0x80--0x9f are controls. 0xc2 followed by <0x80 is
// simply invalid utf8.
if(egc[0] == 0xc2){
if(egc[1] < 0xa0){
return true;
}
}
}
return false;
} |
// Eat an EGC from the UTF-8 string input, counting bytes and columns. We use
// libunistring's uc_is_grapheme_break() to segment EGCs. Writes the number of
// columns to '*colcount'. Returns the number of bytes consumed, not including
// any NUL terminator. Neither the number of bytes nor columns is necessarily
// equal to the number of decoded code points. Such are the ways of Unicode.
// uc_is_grapheme_break() wants UTF-32, which is fine, because we need wchar_t
// to use wcwidth() anyway FIXME except this doesn't work with 16-bit wchar_t!
static inline int
utf8_egc_len(const char* gcluster, int* colcount){
size_t ret = 0;
*colcount = 0;
int r;
mbstate_t mbt;
memset(&mbt, 0, sizeof(mbt));
wchar_t wc, prevw = 0;
do{
r = mbrtowc(&wc, gcluster, MB_CUR_MAX, &mbt);
if(r < 0){
return -1;
}
if(prevw && uc_is_grapheme_break(prevw, wc)){
break; // starts a new EGC, exit and do not claim
}
int cols = wcwidth(wc);
if(cols < 0){
if(iswspace(wc)){ // newline or tab
return ret + 1;
}
return -1;
}
*colcount += cols;
ret += r;
gcluster += r;
prevw = wc;
}while(r);
return ret;
} |
note that this algorithm is imperfect, because (a) |
i think this is about everything needed to be said? closing this up. good discussion. |
Well, that's more of a mess than I was hoping for. No wonder why I ended up as an electronics engineer, where no fucked up heritage of a dozen of character encodings clogs up efficient development. |
i mean, i would hope your code can faithfully reproduce your own last name =]. i'm happy to look over ASCII-only code, but i'm not going to merge it in that condition. full unicode support is a fundamental feature of Notcurses. |
and while having to deal with character encodings is indeed one of the less pleasant elements of computer science, we make that back through those old watchwords, modularity and encapsulation. by using the functions mentioned, your code oughtn't need know anything about unicode other than "i need to use these functions to segment EGCs". if there's anything missing, let me know, but i think i've fleshed out the whole unicode/EGC thing pretty thoroughly, and unit tested the hell out of it. so just make sure you're using the functionality available, and it ought not be much more difficult than the ASCII-only equivalent. =] |
Ideally my code should play notcurses-demo inside ncplane, including the unicode orgy. No worries, I'm not giving up on a functionality just because I have a hard time understanding it. The sole reason why I need to deal with UTF-8 stuff is to protect the continuity of the byte stream. My code must be aware of any multi-byte chunk (be it UTF-8 or VTspeak) and carry over the unresolved stub in front of the next buffer content. I did that with a few SGRs and it works pretty well even with 16-byte buffer. |
yeah this makes total sense, let me whip something up. done. |
Awesome, thanks! |
Invalid ranges for UTF-8 first bytes:
These bytes cannot appear in valid UTF-8. For invalid bytes, the length must be either 0 or 1. |
Consider a table lookup // utf: First byte based UTF-8 codepoint lengths.
int utf8lengths[] =
{ // 0 1 2 3 4 5 6 7 8 9 A B C D E F
/* 0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/* 1 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/* 2 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/* 3 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/* 4 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/* 5 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/* 6 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/* 7 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/* 8 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* 9 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* A */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* B */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
/* C */ 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/* D */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/* E */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
/* F */ 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
}; |
ooooh indeed, fixed up, thanks! https://unicode.org/versions/corrigendum1.html |
i would want to microbenchmark before doing any such thing. i personally doubt that even three branch mispredictions (for the four byte case) is going to come anywhere close to a cacheline fill for this 2KB (32 64B cache lines). i'd further suspect that at least one compiler out of gcc, clang, and icc can generate this lookup table from the if statements if it's indeed the optimal way to do things =]. a partitioning conditional of the klind i have looks just like this to compiler internals anyway. and this is far less debuggable =] |
to be clear, i appreciate the suggestion, and your verve for lookup tables. i just personally bet that (in a benchmark that evicts cachelines regularly, so the entire 2KB doesn't sit untrammeled in L1) the few branch mispredictions are at most 3xpipeline cycles lost (call it 60 cycles), whereas pulling a cacheline in from DRAM is going to be hundreds of cycles minimum. assume that this structure gets wholly evicted between calls (quite plausible). there's a single compulsory cache miss, so unless you think this table is not going to be evicted, you're betting AMAT vs pipeline length. i'll take pipeline length every time, or at least with every processor released since 2000. |
It seems to me that the function call is too expensive and, if used, it must be explicitly inlined. Perhaps the compiler will do it, but it probably depends on where the function is used. It also seems to me that there are a lot of other calls between successive calls to check the UTF-8 length. This overrides any optimization. Here you need to profile each specific case. It is not known how the card will fall. 🙂 |
Would that help if I promised not to call it if i'm >5 bytes away from the end of the buffer? :D |
|
Not if I'm away from the end of the buffer - in that case I may safely call putegc(). |
it's
of course =] |
Yay, I implemented UTF-8 yesterday in my proto-VT. |
I was wrong. I missed this detail. |
@tomek-szczesny Here you can find cool samples to test your vt-parser https://16colo.rs/ |
love it |
That's a nice collection, but I'm pretty sure ASCII chars and 16 colors are well tested by now. ;) If you guys stumble across any terminal program that generates static 24-bit color output, let me know. For now this is the unchallenged feature that in theory is implemented. :) |
You can use GIMP. There it is possible to export a trucolored image to a C-struct. Menu Replace exported C-struct at the beginning of the next code and profit. 😀 static const struct /* GIMP RGB C-Source image dump (Untitled2.c) */
{
int width;
int height;
int bytes_per_pixel; /* 2:RGB16, 3:RGB, 4:RGBA */
unsigned char pixel_data[10 * 10 * 3 + 1];
} gimp_image = {
10, 10, 3,
"\377\374\374\377\371\371\377\317\317\377\305\305\377\250\250\377\247\247"
"\377\250\250\377\311\311\377\345\345\377\376\376\377\307\307\377\300\300"
"\377YY\377$%\377\036\036\377\036\036\377\037\037\377\064\064\377\226\226\377\372"
"\372\377\374\374\377\374\374\374\356\361\347y\245\361Ae\377@A\377[[\371\315"
"\322\370\362\362\377\377\377\377\377\377\376\376\376\350\350\370Z:\361\276"
"L\236\373>@\366\251\263m_\363\232\232\245\370\370\370\376\376\376\345\345"
"\354ff\354eS\371\344Yw\366\201\207\213u\350{{\362\341\341\350\376\376\376"
"\374\374\374\276\276\314\\\\\372\216\216\375\236\224\242\325\304\343\242"
"Q\310\354\354\370\376\376\376\377\377\377\374\374\377\316\316\375\261\261"
"\376\257\257\365\253\231\262\353\035o\377\003\003\377\334\334\377\377\377\377"
"\377\377\374\374\377\316\316\377qq\377\202\202\372YR\363\317\235\257\375"
"ss\377\366\366\377\377\377\377\377\377\377\377\377\374\374\377\354\354\377"
"II\377\351\351\374\373\373\373\377\366\366\377\377\377\377\377\377\377\377"
"\377\377\377\377\377\377\377\377\377\377\363\363\377\377\377\377\377\377"
"\377\377\377\377\377\377\377\377\377\377\377\377\377",
};
#include <iostream>
#include <string>
struct rgb
{
unsigned char r;
unsigned char g;
unsigned char b;
bool operator ==(rgb const& c) { return c.r == r && c.g == g && c.b == b; }
bool operator !=(rgb const& c) { return !operator==(c); }
};
std::string fgc(rgb c)
{
return "\033[38;2;" + std::to_string(c.r) + ';'
+ std::to_string(c.g) + ';'
+ std::to_string(c.b) + 'm';
}
std::string bgc(rgb c)
{
return "\033[48;2;" + std::to_string(c.r) + ';'
+ std::to_string(c.g) + ';'
+ std::to_string(c.b) + 'm';
}
int main()
{
int w = gimp_image.width;
int h = gimp_image.height;
int s = gimp_image.bytes_per_pixel;
int i = 0; // upper line
int j = w * s; // lower line
std::string result;
rgb old_bg = {};
rgb old_fg = {};
result += bgc(old_bg) + fgc(old_fg);
for (int y = 0; y < h; y += 2)
{
for (int x = 0; x < w; x++)
{
rgb bg = rgb{ gimp_image.pixel_data[i + 0],
gimp_image.pixel_data[i + 1],
gimp_image.pixel_data[i + 2] };
rgb fg = rgb{ gimp_image.pixel_data[j + 0],
gimp_image.pixel_data[j + 1],
gimp_image.pixel_data[j + 2] };
if (bg == fg)
{
if (bg != old_bg)
{
old_bg = bg;
result += bgc(bg);
}
result += " ";
}
else
{
if (bg != old_bg)
{
old_bg = bg;
result += bgc(bg);
if (fg != old_fg)
{
old_fg = fg;
result += fgc(fg);
}
}
else
{
if (fg != old_fg)
{
old_fg = fg;
result += fgc(fg);
}
}
result += "▄";
}
i += s;
j += s;
}
result += "\033[m\n" + bgc(old_bg) + fgc(old_fg);
i += w * s;
j += w * s;
}
result += "\033[m";
std::cout << result;
} Ooops, a small mistake, the height of the picture must be an even number. There are some samples here https://gist.github.com/XVilka/8346728 |
That is all true. When moving cursor around is supported, I think I'll finally be able to run htop LIVE in ncplane! ^^ |
htop's first frame repertoire:
Essential: |
I guess you are very fond of that VTM toy of yours. :) |
VTM is awesome! |
Thank you! I hope this multiplexer will be useful to someone for pair programming. Its main functionality is session life sharing (via SSH or somehow). |
As far as I know, these sequences change the mode/format of the keystrokes that are sent by the terminal to the application. |
@o-sdn-o , I invite you to share your VTspeak knowledge in my dedicated repo :) |
@tomek-szczesny Two more useful links Terminal developers rallying point Terminal capabilities to applications |
As far as Wiki's concerned, UTF-8 can be unambiguously parsed without any sort of termination. Why there must be a termination character at the end of wchar_t*?
The text was updated successfully, but these errors were encountered: