Why ncplane_putegc() needs termination character? #1871

tomek-szczesny · 2021-07-03T13:57:11Z

As far as Wiki's concerned, UTF-8 can be unambiguously parsed without any sort of termination. Why there must be a termination character at the end of wchar_t*?

o-sdn-o · 2021-07-03T15:07:58Z

UTF-8 can be unambiguously parsed without any sort of termination to get a single codepoint.

In this case, the wiki is talking about a sequence of Unicode code points that form a grapheme cluster.

single UTF-8 codepoint = byte + byte + ... + byte
single wchar_t codepoint ~ = wchar_t or (wchar_t + wchar_t for surrogate pair)

grapheme cluster = EGC = codepoint + ... + codepoint + string_terminator

in expanded form:

UTF-8:
grapheme cluster = EGC = (byte + ... + byte) + ... + (byte + ... + byte) + string_terminator

wchar_t:
grapheme cluster = EGC = (wchar_t) + ... + (wchar_t) + string_terminator

tomek-szczesny · 2021-07-03T15:37:40Z

Okay, my bad, I mixed up two things in here. Let me put it this way:

ncplane_putwegc(): writes a single EGC from an array of wchar_t
ncplane_putegc(): writes a single EGC from an array of UTF-8

static inline int ncplane_putegc(struct ncplane* n, const char* gclust, int* sbytes);

Functions accepting a single EGC expect (...) a series of UTF-8 char terminated by '\0'.

I have a byte stream that I'm trying to parse, including UTF-8 support. It seems I cannot just point putegc() to a beginning of UTF-8 codepoint in a buffer. Instead I'll have to copy it first, terminate with /0, and then call putegc() on it. Is that correct?

o-sdn-o · 2021-07-03T16:16:26Z

You can send the entire UTF-8 string, these functions will bite off exactly as many bytes from the beginning of the string as the length of the grapheme cluster. These functions

ncplane_putegc and friends

notcurses/src/lib/notcurses.c

Lines 1632 to 1640 in 1287d8b

    
           int ncplane_putegc_yx(ncplane* n, int y, int x, const char* gclust, int* sbytes){ 
        
             int cols; 
        
             int bytes = utf8_egc_len(gclust, &cols); 
        
             if(bytes < 0){ 
        
               return -1; 
        
             } 
        
             if(sbytes){ 
        
               *sbytes = bytes; 
        
             }

return in the sbytes variable how much they have bitten off from the beginning of the line.

notcurses/src/lib/egcpool.h

Lines 63 to 78 in 1287d8b

    
           // Eat an EGC from the UTF-8 string input, counting bytes and columns. We use 
        
           // libunistring's uc_is_grapheme_break() to segment EGCs. Writes the number of 
        
           // columns to '*colcount'. Returns the number of bytes consumed, not including 
        
           // any NUL terminator. Neither the number of bytes nor columns is necessarily 
        
           // equal to the number of decoded code points. Such are the ways of Unicode. 
        
           // uc_is_grapheme_break() wants UTF-32, which is fine, because we need wchar_t 
        
           // to use wcwidth() anyway FIXME except this doesn't work with 16-bit wchar_t! 
        
           static inline int 
        
           utf8_egc_len(const char* gcluster, int* colcount){ 
        
             size_t ret = 0; 
        
             *colcount = 0; 
        
             int r; 
        
             mbstate_t mbt; 
        
             memset(&mbt, 0, sizeof(mbt)); 
        
             wchar_t wc, prevw = 0; 
        
             do{

o-sdn-o · 2021-07-03T16:28:03Z

Responsible for interruption at the end of the grapheme cluster is the following check

notcurses/src/lib/egcpool.h

Lines 83 to 85 in 1287d8b

    
           if(prevw && uc_is_grapheme_break(prevw, wc)){ 
        
             break; // starts a new EGC, exit and do not claim 
        
           }

It turns out that the string terminator is needed only at the end of the entire UTF-8 line.

o-sdn-o · 2021-07-03T16:39:48Z

gclust = UTF-8 line = (EGC1) + ... + (EGCn) + string_terminator

where
EGC = codepoint + ... + codepoint
codepoint = byte + ... + byte

after calling ncplane_putegc( ... const char* gclust, int* sbytes) sbytes contains the length of EGC1 in bytes

after that you can
gclust += sbytes;

gclust = (EGC2) + ... + (EGCn) + string_terminator

tomek-szczesny · 2021-07-03T21:24:58Z

I have an SSH stream (or any other terminal stream) that is a mixture of plain text, terminal escape codes and possibly UTF-8 EGCs here and there. I'm making a parser for whatever I may stumble across. What I want to do is to recognize UTF-8 codepoints, so these are properly rendered, and properly handled if they happen to be split across buffer iterations.

So, no, I'm not dealing with well defined UTF-8 string or anything like that. I need an answer to a question exactly like it has been asked. If my parser detects UTF-8 EGC, I have to put EGC into the ncplane, and I need to know the most efficient way for the sake of putvt() functionality I'm working at. I'll try just shoving it into putegc() and see if it work, but even if it does, the documentation shall get an update, I guess?

dankamongmen · 2021-07-03T21:59:48Z

UTF-8 can be unambiguously parsed without any sort of termination to get a single codepoint.
In this case, the wiki is talking about a sequence of Unicode code points that form a grapheme cluster.

i was unaware that the talented @o-sdn-o was reading our bugs, but welcome! and he is correct. an individual encoded Unicode codepoint can be lexed without explicit termination, but this is not true for EGCs.

o-sdn-o · 2021-07-03T22:08:48Z

I'm not dealing with well defined UTF-8 string or anything like that. I need an answer to a question exactly like it has been asked. If my parser detects UTF-8 EGC, I have to put EGC into the ncplane, and I need to know the most efficient way for the sake of putvt() functionality I'm working at.

The value of the first byte in the string determines what to retrieve first. If the first byte is a control byte, treat it and the necessary subsequent bytes as an escape sequence. Otherwise, this is the beginning of a grapheme cluster.

o-sdn-o · 2021-07-03T22:18:35Z

It is important to keep in mind the following points:

a byte can be an ASCII control character C0
can be the first byte of a UTF-8-encoded control code point
or a C1 control character

dankamongmen · 2021-07-03T22:32:15Z

@tomek-szczesny we've got functions for most of this kind of thing. what you almost certainly want for this is to use nccell_load() on the input stream, breaking it up into nccells at EGC boundaries, and then use ncplane_putc() to put those nccells down onto the output plane. this would require keeping your own idea of current color around, since nccell encompasses color. if you didn't want to do that, use ncplane_putegc() on the input stream directly. you'll get a return value indicating the number of columns consumed, and the sbytes parameter will hold the number of bytes consumed.

dankamongmen · 2021-07-03T22:35:00Z

the following will take an input stream and spit it out as EGCs to a plane (not tested):

// returns columns consumed, or -1 on invalid EGC / out of output space
int spray_utf8_egcs(const char* utf8text, struct ncplane* n, int* sbytes){
  int cols = 0;
  *sbytes = 0;
  while(*utf8text){
    int b, c;
    if((c = ncplane_putegc(n, utf8text, &b)) <= 0){
      return -1;
    }
    utf8text += b;
    *sbytes += b;
    cols += c;
  }
  return cols;
}

dankamongmen · 2021-07-03T22:37:37Z

the following will take an input stream and spit it out as EGCs to a plane (not tested):

// returns columns consumed, or -1 on invalid EGC / out of output space
int spray_utf8_egcs(const char* utf8text, struct ncplane* n, int* sbytes){
  int cols = 0;
  *sbytes = 0;
  while(*utf8text){
    int b, c;
    if((c = ncplane_putegc(n, utf8text, &b)) <= 0){
      return -1;
    }
    utf8text += b;
    *sbytes += b;
    cols += c;
  }
  return cols;
}

@tomek-szczesny , this is probably the easiest and fastest way to do what you want, if i understand you correctly. most of what you're doing could pretty much be this function plus an escape check prior to the ncplane_putegc(), diverting into extraction and dispatch of said control sequence. in fact...yeah, i hope that's exactly how you're writing this (ideally in a way that can be fed streaming data, which is a bit more difficult, and you can hold off on if you'd like).

dankamongmen · 2021-07-03T22:40:06Z

It is important to keep in mind the following points:

a byte can be an ASCII control character C0

can be the first byte of a UTF-8-encoded control code point

or a C1 control character

a good point is raised here -- all the ncplane_*() output functions will reject a C0 character (except for '\n' iff the plane is set to scroll, and horizontal tab in the near future). see is_control_egc():

// is it a control character? check C0 and C1, but don't count empty strings,                              
// nor single-byte strings containing only a NUL character.                                                
static inline bool                                                                                         
is_control_egc(const unsigned char* egc, int bytes){                                                       
  if(bytes == 1){                                                                                          
    if(*egc && iscntrl(*egc)){                                                                             
      return true;                                                                                         
    }                                                                                                      
  }else if(bytes == 2){                                                                                    
    // 0xc2 followed by 0x80--0x9f are controls. 0xc2 followed by <0x80 is                                 
    // simply invalid utf8.                                                                                
    if(egc[0] == 0xc2){                                                                                    
      if(egc[1] < 0xa0){                                                                                   
        return true;                                                                                       
      }                                                                                                    
    }                                                                                                      
  }                                                                                                        
  return false;                                                                                            
}

dankamongmen · 2021-07-03T22:43:28Z

utf8_egc_len() is also relevant, as it implements our segmentation algorithm:

// Eat an EGC from the UTF-8 string input, counting bytes and columns. We use                              
// libunistring's uc_is_grapheme_break() to segment EGCs. Writes the number of                             
// columns to '*colcount'. Returns the number of bytes consumed, not including                             
// any NUL terminator. Neither the number of bytes nor columns is necessarily                              
// equal to the number of decoded code points. Such are the ways of Unicode.                               
// uc_is_grapheme_break() wants UTF-32, which is fine, because we need wchar_t                             
// to use wcwidth() anyway FIXME except this doesn't work with 16-bit wchar_t!                             
static inline int                                                                                          
utf8_egc_len(const char* gcluster, int* colcount){                                                         
  size_t ret = 0;                                                                                          
  *colcount = 0;                                                                                           
  int r;                                                                                                   
  mbstate_t mbt;                                                                                           
  memset(&mbt, 0, sizeof(mbt));                                                                            
  wchar_t wc, prevw = 0;                                                                                   
  do{                                                                                                      
    r = mbrtowc(&wc, gcluster, MB_CUR_MAX, &mbt);                                                          
    if(r < 0){                                                                                             
      return -1;                                                                                           
    }                                                                                                      
    if(prevw && uc_is_grapheme_break(prevw, wc)){                                                          
      break; // starts a new EGC, exit and do not claim                                                    
    }                                                                                                      
    int cols = wcwidth(wc);                                                                                
    if(cols < 0){                                                                                          
      if(iswspace(wc)){ // newline or tab                                                                  
        return ret + 1;                                                                                    
      }                                                                                                    
      return -1;                                                                                           
    }                                                                                                      
    *colcount += cols;                                                                                     
    ret += r;                                                                                              
    gcluster += r;                                                                                         
    prevw = wc;                                                                                            
  }while(r);                                                                                               
  return ret;                                                                                              
}

dankamongmen · 2021-07-03T22:51:22Z

utf8_egc_len() is also relevant, as it implements our segmentation algorithm:
// Eat an EGC from the UTF-8 string input, counting bytes and columns. We use                              

note that this algorithm is imperfect, because (a) wcwidth() is imperfect and (b) i don't think this accounts for i.e. ZWJ composed emoji (maybe it does? not for width, though). i wouldn't worry about that. see #1329. @joseluis gets mad at me about this every so often, but doing it perfectly would require knowledge of the (a) font (b) font rendering enginer and (c) terminal font system, and fuck all that noise.

dankamongmen · 2021-07-03T22:52:58Z

i think this is about everything needed to be said? closing this up. good discussion.

tomek-szczesny · 2021-07-03T23:50:09Z

Well, that's more of a mess than I was hoping for. No wonder why I ended up as an electronics engineer, where no fucked up heritage of a dozen of character encodings clogs up efficient development.
I'll look into that tomorrow, and clean up my code before making it public, because I see expectations lurking here and there.. ;)
Thanks for the tips, most interesting indeed. I'll try to make use of that later on. Just need a sample output that actually uses anything besides ASCII, for testing. Hm, would be far easier if I knew Russian or something ;)

dankamongmen · 2021-07-04T00:22:07Z

Well, that's more of a mess than I was hoping for. No wonder why I ended up as an electronics engineer, where no fucked up heritage of a dozen of character encodings clogs up efficient development.
I'll look into that tomorrow, and clean up my code before making it public, because I see expectations lurking here and there.. ;)
Thanks for the tips, most interesting indeed. I'll try to make use of that later on. Just need a sample output that actually uses anything besides ASCII, for testing. Hm, would be far easier if I knew Russian or something ;)

i mean, i would hope your code can faithfully reproduce your own last name =]. i'm happy to look over ASCII-only code, but i'm not going to merge it in that condition. full unicode support is a fundamental feature of Notcurses.

dankamongmen · 2021-07-04T00:27:34Z

i mean, i would hope your code can faithfully reproduce your own last name =]. i'm happy to look over ASCII-only code, but i'm not going to merge it in that condition. full unicode support is a fundamental feature of Notcurses.

and while having to deal with character encodings is indeed one of the less pleasant elements of computer science, we make that back through those old watchwords, modularity and encapsulation. by using the functions mentioned, your code oughtn't need know anything about unicode other than "i need to use these functions to segment EGCs". if there's anything missing, let me know, but i think i've fleshed out the whole unicode/EGC thing pretty thoroughly, and unit tested the hell out of it. so just make sure you're using the functionality available, and it ought not be much more difficult than the ASCII-only equivalent. =]

tomek-szczesny · 2021-07-04T10:57:36Z

i mean, i would hope your code can faithfully reproduce your own last name =]. i'm happy to look over ASCII-only code, but i'm not going to merge it in that condition. full unicode support is a fundamental feature of Notcurses.

Ideally my code should play notcurses-demo inside ncplane, including the unicode orgy. No worries, I'm not giving up on a functionality just because I have a hard time understanding it.
I guess I may benefit from reading that chapter from your book again.

The sole reason why I need to deal with UTF-8 stuff is to protect the continuity of the byte stream. My code must be aware of any multi-byte chunk (be it UTF-8 or VTspeak) and carry over the unresolved stub in front of the next buffer content. I did that with a few SGRs and it works pretty well even with 16-byte buffer.
The function returning a byte length of a codepoint would be most helpful, if it can draw conclusions just by looking at the first byte. I cannot guarantee the whole codepoint is in the buffer.

dankamongmen · 2021-07-04T13:11:33Z

The function returning a byte length of a codepoint would be most helpful, if it can draw conclusions just by looking at the first byte. I cannot guarantee the whole codepoint is in the buffer.

yeah this makes total sense, let me whip something up. done.

tomek-szczesny · 2021-07-04T13:41:56Z

Awesome, thanks!

o-sdn-o · 2021-07-04T19:32:05Z

add utf8_codepoint_length() #1871

Invalid ranges for UTF-8 first bytes:

[0x80, 0xc1]
[0xf5, 0xff]

These bytes cannot appear in valid UTF-8. For invalid bytes, the length must be either 0 or 1.

o-sdn-o · 2021-07-04T19:42:27Z

Consider a table lookup

    // utf: First byte based UTF-8 codepoint lengths.
    int utf8lengths[] =
    {	//      0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F
        /* 0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 1 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 2 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 3 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 4 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 5 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 6 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 7 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 8 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 9 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* A */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* B */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* C */ 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        /* D */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        /* E */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        /* F */ 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    };

dankamongmen · 2021-07-04T19:58:28Z

add utf8_codepoint_length() #1871

Invalid ranges for UTF-8 first bytes:

[0x80, 0xc1]

[0xf5, 0xff]

These bytes cannot appear in valid UTF-8. For invalid bytes, the length must be either 0 or 1.

ooooh indeed, fixed up, thanks! https://unicode.org/versions/corrigendum1.html

dankamongmen · 2021-07-04T20:01:12Z

Consider a table lookup

    // utf: First byte based UTF-8 codepoint lengths.
    int utf8lengths[] =
    {	//      0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F
        /* 0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 1 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 2 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 3 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 4 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 5 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 6 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 7 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 8 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 9 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* A */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* B */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* C */ 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        /* D */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        /* E */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        /* F */ 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    };

i would want to microbenchmark before doing any such thing. i personally doubt that even three branch mispredictions (for the four byte case) is going to come anywhere close to a cacheline fill for this 2KB (32 64B cache lines).

i'd further suspect that at least one compiler out of gcc, clang, and icc can generate this lookup table from the if statements if it's indeed the optimal way to do things =]. a partitioning conditional of the klind i have looks just like this to compiler internals anyway.

and this is far less debuggable =]

dankamongmen · 2021-07-04T20:04:50Z

i would want to microbenchmark before doing any such thing. i personally doubt that even three branch mispredictions (for the four byte case) is going to come anywhere close to a cacheline fill for this 2KB (32 64B cache lines).

i'd further suspect that at least one compiler out of gcc, clang, and icc can generate this lookup table from the if statements if it's indeed the optimal way to do things =]. a partitioning conditional of the klind i have looks just like this to compiler internals anyway.

and this is far less debuggable =]

to be clear, i appreciate the suggestion, and your verve for lookup tables. i just personally bet that (in a benchmark that evicts cachelines regularly, so the entire 2KB doesn't sit untrammeled in L1) the few branch mispredictions are at most 3xpipeline cycles lost (call it 60 cycles), whereas pulling a cacheline in from DRAM is going to be hundreds of cycles minimum. assume that this structure gets wholly evicted between calls (quite plausible). there's a single compulsory cache miss, so unless you think this table is not going to be evicted, you're betting AMAT vs pipeline length. i'll take pipeline length every time, or at least with every processor released since 2000.

o-sdn-o · 2021-07-04T20:44:44Z

i would want to microbenchmark before doing any such thing. i personally doubt that even three branch mispredictions (for the four byte case) is going to come anywhere close to a cacheline fill for this 2KB (32 64B cache lines).

It seems to me that the function call is too expensive and, if used, it must be explicitly inlined. Perhaps the compiler will do it, but it probably depends on where the function is used.

It also seems to me that there are a lot of other calls between successive calls to check the UTF-8 length. This overrides any optimization.

Here you need to profile each specific case. It is not known how the card will fall. 🙂

tomek-szczesny · 2021-07-04T21:05:10Z

Would that help if I promised not to call it if i'm >5 bytes away from the end of the buffer? :D

o-sdn-o · 2021-07-04T21:07:06Z

Would that help if I promised not to call it if i'm >5 bytes away from the end of the buffer? :D

~~You have to find out the length anyway.~~

tomek-szczesny · 2021-07-04T21:15:19Z

Not if I'm away from the end of the buffer - in that case I may safely call putegc().

dankamongmen · 2021-07-04T22:43:03Z

It seems to me that the function call is too expensive and, if used, it must be explicitly inlined. Perhaps the compiler will do it, but it probably depends on where the function is used.

it's static inline! i do not expect to see this function ever show up as an identifier

It also seems to me that there are a lot of other calls between successive calls to check the UTF-8 length. This overrides any optimization.
Here you need to profile each specific case. It is not known how the card will fall.

of course =]

tomek-szczesny · 2021-07-06T12:27:55Z

Yay, I implemented UTF-8 yesterday in my proto-VT.
Indeed I successfully used putegc() pointed at specific places in my buffer and it parsed the underlying EGCs no problem, EXCEPT of one case.
For whatever reason, putegc() always fails if pointed at the last complete codepoint in the buffer, if the following codepoint is incomplete, as far as I can tell.
I didn't find the underlying cause, I just assumed that if putegc() fails, it's time to wrap up the state machine party and deal with the rest of the buffer on next iteration.

o-sdn-o · 2021-07-07T11:09:10Z

it's static inline! i do not expect to see this function ever show up as an identifier

I was wrong. I missed this detail.

o-sdn-o · 2021-07-07T11:12:45Z

Just need a sample output that actually uses anything besides ASCII, for testing. Hm, would be far easier if I knew Russian or something ;)

@tomek-szczesny Here you can find cool samples to test your vt-parser https://16colo.rs/

dankamongmen · 2021-07-07T12:26:24Z

@tomek-szczesny Here you can find cool samples to test your vt-parser https://16colo.rs/

love it

tomek-szczesny · 2021-07-07T12:30:42Z

@tomek-szczesny Here you can find cool samples to test your vt-parser https://16colo.rs/

That's a nice collection, but I'm pretty sure ASCII chars and 16 colors are well tested by now. ;)

If you guys stumble across any terminal program that generates static 24-bit color output, let me know. For now this is the unchallenged feature that in theory is implemented. :)

o-sdn-o · 2021-07-07T13:49:37Z

If you guys stumble across any terminal program that generates static 24-bit color output, let me know.

You can use GIMP. There it is possible to export a trucolored image to a C-struct.

Menu File->Export As... -> Select File Type C source code -> Export

Replace exported C-struct at the beginning of the next code and profit. 😀

static const struct /* GIMP RGB C-Source image dump (Untitled2.c) */
{
    int width;
    int height;
    int bytes_per_pixel; /* 2:RGB16, 3:RGB, 4:RGBA */
    unsigned char pixel_data[10 * 10 * 3 + 1];
} gimp_image = {
  10, 10, 3,
  "\377\374\374\377\371\371\377\317\317\377\305\305\377\250\250\377\247\247"
  "\377\250\250\377\311\311\377\345\345\377\376\376\377\307\307\377\300\300"
  "\377YY\377$%\377\036\036\377\036\036\377\037\037\377\064\064\377\226\226\377\372"
  "\372\377\374\374\377\374\374\374\356\361\347y\245\361Ae\377@A\377[[\371\315"
  "\322\370\362\362\377\377\377\377\377\377\376\376\376\350\350\370Z:\361\276"
  "L\236\373>@\366\251\263m_\363\232\232\245\370\370\370\376\376\376\345\345"
  "\354ff\354eS\371\344Yw\366\201\207\213u\350{{\362\341\341\350\376\376\376"
  "\374\374\374\276\276\314\\\\\372\216\216\375\236\224\242\325\304\343\242"
  "Q\310\354\354\370\376\376\376\377\377\377\374\374\377\316\316\375\261\261"
  "\376\257\257\365\253\231\262\353\035o\377\003\003\377\334\334\377\377\377\377"
  "\377\377\374\374\377\316\316\377qq\377\202\202\372YR\363\317\235\257\375"
  "ss\377\366\366\377\377\377\377\377\377\377\377\377\374\374\377\354\354\377"
  "II\377\351\351\374\373\373\373\377\366\366\377\377\377\377\377\377\377\377"
  "\377\377\377\377\377\377\377\377\377\377\363\363\377\377\377\377\377\377"
  "\377\377\377\377\377\377\377\377\377\377\377\377\377",
};

#include <iostream>
#include <string>

struct rgb
{
    unsigned char r;
    unsigned char g;
    unsigned char b;
    bool operator ==(rgb const& c) { return c.r == r && c.g == g && c.b == b; }
    bool operator !=(rgb const& c) { return !operator==(c); }
};
std::string fgc(rgb c)
{
    return "\033[38;2;" + std::to_string(c.r) + ';'
                        + std::to_string(c.g) + ';'
                        + std::to_string(c.b) + 'm';
}
std::string bgc(rgb c)
{
    return "\033[48;2;" + std::to_string(c.r) + ';'
                        + std::to_string(c.g) + ';'
                        + std::to_string(c.b) + 'm';
}
int main()
{
    int w = gimp_image.width;
    int h = gimp_image.height;
    int s = gimp_image.bytes_per_pixel;
    int i = 0;     // upper line
    int j = w * s; // lower line
    std::string result;
    rgb old_bg = {};
    rgb old_fg = {};
    result += bgc(old_bg) + fgc(old_fg);
    for (int y = 0; y < h; y += 2)
    {
        for (int x = 0; x < w; x++)
        {
            rgb bg = rgb{ gimp_image.pixel_data[i + 0],
                          gimp_image.pixel_data[i + 1],
                          gimp_image.pixel_data[i + 2] };
            rgb fg = rgb{ gimp_image.pixel_data[j + 0],
                          gimp_image.pixel_data[j + 1],
                          gimp_image.pixel_data[j + 2] };
            if (bg == fg) 
            {
                if (bg != old_bg)
                {
                    old_bg = bg;
                    result += bgc(bg);
                }
                result += " ";
            }
            else 
            {
                if (bg != old_bg)
                {
                    old_bg = bg;
                    result += bgc(bg);
                    if (fg != old_fg)
                    {
                        old_fg = fg;
                        result += fgc(fg);
                    }
                }
                else
                {
                    if (fg != old_fg)
                    {
                        old_fg = fg;
                        result += fgc(fg);
                    }
                }
                result += "▄";
            }
            i += s;
            j += s;
        }
        result += "\033[m\n" + bgc(old_bg) + fgc(old_fg);
        i += w * s;
        j += w * s;
    }
    result += "\033[m";
    std::cout << result;
}

Ooops, a small mistake, the height of the picture must be an even number.

There are some samples here https://gist.github.com/XVilka/8346728

tomek-szczesny · 2021-07-07T15:10:48Z

or just dissect a part of notcurses-info :)

24-bit colors and Unicode seem to work fine, now this shows there's plenty more missing. ;)

o-sdn-o · 2021-07-07T15:19:42Z

Parsing of cursor positioning commands does not work

ESC [ row ; column H

DECSET/DECRST - ESC [ ? n h/ESC [ ? n l

and a couple of others

See
https://invisible-island.net/xterm/ctlseqs/ctlseqs.html

tomek-szczesny · 2021-07-07T15:24:34Z

That is all true. When moving cursor around is supported, I think I'll finally be able to run htop LIVE in ncplane! ^^

o-sdn-o · 2021-07-07T15:48:38Z

htop's first frame repertoire:

\e]0; ... \007
\e[?1049h                                                                                
\e[22;0;0t
\e[1;27r
\e(B
\e[m
\e[4l
\e[?7h
\e[?1h
\e=
\e[?25l 
\e[?1000h 
\e[2J
\e[J
\e[2d
\e[30X    
\e[K
\e[?12l
\e[?25h
\e[?1000l
\e[?1049l
\e[?1l

Essential:
\e[m \e[2j \e[J \e[2d \e[30X \e[K \e[H

tomek-szczesny · 2021-07-07T16:13:02Z

I guess you are very fond of that VTM toy of yours. :)
Thanks, a lot, that's a pretty handy checklist alright.

dankamongmen · 2021-07-07T16:15:42Z

VTM is awesome!

o-sdn-o · 2021-07-07T16:37:44Z

VTM is awesome!

Thank you! I hope this multiplexer will be useful to someone for pair programming. Its main functionality is session life sharing (via SSH or somehow).

tomek-szczesny · 2021-07-07T16:38:59Z

I have just issued the most fucked up vim command ever. And it worked as expected!

// TODO:
//
// \e[m                 // SGR (TODO: Default argument)
// \e[2J                // Erase in display (args 0-3) 
// \e[J                 // Erase in display 0
// \e[2d                // Line Position Absolute (Default 1)
// \e[30X               // Erase 30 characters (Default 1)
// \e[K                 // Erase in line, args 0-2 (default 0)
// \e[y;xH              // Move cursor to y,x
// \e]0; ... \007       // ESC ] = OSC, terminated with BEL (0x07) or ST (0x1b \), or nothing
// \e[?1049h            // Alternative screen buffer
// \e[?1049l            // Disable alternative screen buffer
// \e[1;27r             // Set scrolling region (from, to) (default top, bottom)
// \e[4h                // Set Mode (12 = Send/Receive; 20 = automatic newline; 4 = insert mode; +1)
// \e[4l                // Reset Mode (2 = Keyboard Action Mode, 4 = Replace mode; +2)
// \e[?7h               // Auto wrap mode (DECAWM)
// \e[?25h              // Show cursor
// \e[?25l              // Hide cursor
// \e[?1000h            // Send Mouse X & Y on button press and release. This is the X11 xterm mouse protocol.
// \e[?1000l            // Don't send...
//
// Essential:
// \e[m \e[2j \e[J \e[2d \e[30X \e[K \e[H
//
// WTF SEQUENCES:
// \e=                  // Application Keypad (DECKPAM)
// \e[?1h               // Application cursor keys (DECCKM)
// \e[?1l               // Normal Cursor Keys
//
// WON'T IMPLEMENT:
// \e(B                 // G0 character set -> USASCII 
// \e[22;0;0t           // Window Manipulation (XTWINOPS)
// \e[?12l              // Start/Stop blinking cursor

What have I done...

o-sdn-o · 2021-07-07T20:21:22Z

// WTF SEQUENCES:
// \e= // Application Keypad (DECKPAM)
// \e[?1h // Application cursor keys (DECCKM)
// \e[?1l // Normal Cursor Keys

As far as I know, these sequences change the mode/format of the keystrokes that are sent by the terminal to the application.

tomek-szczesny · 2021-07-08T10:04:40Z

@o-sdn-o , I invite you to share your VTspeak knowledge in my dedicated repo :)
https://github.com/tomek-szczesny/notcurses-vt-proto/issues
Some issues are marked as "question" or "discussion", but feel free to to explore them all or add whatever you feel is useful.
@dankamongmen, you may want to watch this repo too, don't feel ignored. :)

o-sdn-o · 2021-07-08T11:49:47Z

@tomek-szczesny Two more useful links

Terminal developers rallying point
https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues

Terminal capabilities to applications
https://gitlab.freedesktop.org/gnachman/specifications/-/tree/feature_reporting/proposals/feature-reporting

tomek-szczesny added the userquestion not quite bugs--inquiries from users label Jul 3, 2021

This comment has been minimized.

Sign in to view

dankamongmen self-assigned this Jul 3, 2021

dankamongmen added this to the 3.0.0 milestone Jul 3, 2021

dankamongmen closed this as completed Jul 3, 2021

dankamongmen added a commit that referenced this issue Jul 4, 2021

add utf8_codepoint_length() #1871

acc6637

dankamongmen modified the milestones: 3.0.0, 2.4.0 Aug 24, 2021

Why ncplane_putegc() needs termination character? #1871

Why ncplane_putegc() needs termination character? #1871

Comments

tomek-szczesny commented Jul 3, 2021

o-sdn-o commented Jul 3, 2021 • edited Loading

tomek-szczesny commented Jul 3, 2021

This comment has been minimized.

This comment has been minimized.

o-sdn-o commented Jul 3, 2021 • edited Loading

o-sdn-o commented Jul 3, 2021

o-sdn-o commented Jul 3, 2021 • edited Loading

tomek-szczesny commented Jul 3, 2021

dankamongmen commented Jul 3, 2021

o-sdn-o commented Jul 3, 2021

o-sdn-o commented Jul 3, 2021 • edited Loading

dankamongmen commented Jul 3, 2021

dankamongmen commented Jul 3, 2021 • edited Loading

dankamongmen commented Jul 3, 2021

dankamongmen commented Jul 3, 2021

dankamongmen commented Jul 3, 2021

dankamongmen commented Jul 3, 2021

dankamongmen commented Jul 3, 2021

tomek-szczesny commented Jul 3, 2021

dankamongmen commented Jul 4, 2021

dankamongmen commented Jul 4, 2021

tomek-szczesny commented Jul 4, 2021

dankamongmen commented Jul 4, 2021

tomek-szczesny commented Jul 4, 2021

o-sdn-o commented Jul 4, 2021

o-sdn-o commented Jul 4, 2021 • edited Loading

dankamongmen commented Jul 4, 2021

dankamongmen commented Jul 4, 2021

dankamongmen commented Jul 4, 2021

o-sdn-o commented Jul 4, 2021 • edited Loading

tomek-szczesny commented Jul 4, 2021

o-sdn-o commented Jul 4, 2021 • edited Loading

tomek-szczesny commented Jul 4, 2021

dankamongmen commented Jul 4, 2021

tomek-szczesny commented Jul 6, 2021

o-sdn-o commented Jul 7, 2021

o-sdn-o commented Jul 7, 2021

dankamongmen commented Jul 7, 2021

tomek-szczesny commented Jul 7, 2021

o-sdn-o commented Jul 7, 2021 • edited Loading

tomek-szczesny commented Jul 7, 2021

o-sdn-o commented Jul 7, 2021 • edited Loading

tomek-szczesny commented Jul 7, 2021

o-sdn-o commented Jul 7, 2021 • edited Loading

tomek-szczesny commented Jul 7, 2021

dankamongmen commented Jul 7, 2021

o-sdn-o commented Jul 7, 2021

tomek-szczesny commented Jul 7, 2021 • edited Loading

o-sdn-o commented Jul 7, 2021

tomek-szczesny commented Jul 8, 2021

o-sdn-o commented Jul 8, 2021

o-sdn-o commented Jul 3, 2021 •

edited

Loading

o-sdn-o commented Jul 3, 2021 •

edited

Loading

o-sdn-o commented Jul 3, 2021 •

edited

Loading

o-sdn-o commented Jul 3, 2021 •

edited

Loading

dankamongmen commented Jul 3, 2021 •

edited

Loading

o-sdn-o commented Jul 4, 2021 •

edited

Loading

o-sdn-o commented Jul 4, 2021 •

edited

Loading

o-sdn-o commented Jul 4, 2021 •

edited

Loading

o-sdn-o commented Jul 7, 2021 •

edited

Loading

o-sdn-o commented Jul 7, 2021 •

edited

Loading

o-sdn-o commented Jul 7, 2021 •

edited

Loading

tomek-szczesny commented Jul 7, 2021 •

edited

Loading