Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why ncplane_putegc() needs termination character? #1871

Closed
tomek-szczesny opened this issue Jul 3, 2021 · 51 comments
Closed

Why ncplane_putegc() needs termination character? #1871

tomek-szczesny opened this issue Jul 3, 2021 · 51 comments
Assignees
Labels
userquestion not quite bugs--inquiries from users
Milestone

Comments

@tomek-szczesny
Copy link
Contributor

As far as Wiki's concerned, UTF-8 can be unambiguously parsed without any sort of termination. Why there must be a termination character at the end of wchar_t*?

@tomek-szczesny tomek-szczesny added the userquestion not quite bugs--inquiries from users label Jul 3, 2021
@o-sdn-o
Copy link

o-sdn-o commented Jul 3, 2021

UTF-8 can be unambiguously parsed without any sort of termination to get a single codepoint.

In this case, the wiki is talking about a sequence of Unicode code points that form a grapheme cluster.

single UTF-8 codepoint = byte + byte + ... + byte
single wchar_t codepoint ~ = wchar_t or (wchar_t + wchar_t for surrogate pair)

grapheme cluster = EGC = codepoint + ... + codepoint + string_terminator

in expanded form:

UTF-8:
grapheme cluster = EGC = (byte + ... + byte) + ... + (byte + ... + byte) + string_terminator

wchar_t:
grapheme cluster = EGC = (wchar_t) + ... + (wchar_t) + string_terminator

@tomek-szczesny
Copy link
Contributor Author

Okay, my bad, I mixed up two things in here. Let me put it this way:

ncplane_putwegc(): writes a single EGC from an array of wchar_t
ncplane_putegc(): writes a single EGC from an array of UTF-8

static inline int ncplane_putegc(struct ncplane* n, const char* gclust, int* sbytes);

Functions accepting a single EGC expect (...) a series of UTF-8 char terminated by '\0'.

I have a byte stream that I'm trying to parse, including UTF-8 support. It seems I cannot just point putegc() to a beginning of UTF-8 codepoint in a buffer. Instead I'll have to copy it first, terminate with /0, and then call putegc() on it. Is that correct?

@o-sdn-o

This comment has been minimized.

@o-sdn-o

This comment has been minimized.

@o-sdn-o
Copy link

o-sdn-o commented Jul 3, 2021

You can send the entire UTF-8 string, these functions will bite off exactly as many bytes from the beginning of the string as the length of the grapheme cluster. These functions

ncplane_putegc and friends

notcurses/src/lib/notcurses.c

Lines 1632 to 1640 in 1287d8b

int ncplane_putegc_yx(ncplane* n, int y, int x, const char* gclust, int* sbytes){
int cols;
int bytes = utf8_egc_len(gclust, &cols);
if(bytes < 0){
return -1;
}
if(sbytes){
*sbytes = bytes;
}

return in the sbytes variable how much they have bitten off from the beginning of the line.

// Eat an EGC from the UTF-8 string input, counting bytes and columns. We use
// libunistring's uc_is_grapheme_break() to segment EGCs. Writes the number of
// columns to '*colcount'. Returns the number of bytes consumed, not including
// any NUL terminator. Neither the number of bytes nor columns is necessarily
// equal to the number of decoded code points. Such are the ways of Unicode.
// uc_is_grapheme_break() wants UTF-32, which is fine, because we need wchar_t
// to use wcwidth() anyway FIXME except this doesn't work with 16-bit wchar_t!
static inline int
utf8_egc_len(const char* gcluster, int* colcount){
size_t ret = 0;
*colcount = 0;
int r;
mbstate_t mbt;
memset(&mbt, 0, sizeof(mbt));
wchar_t wc, prevw = 0;
do{

@o-sdn-o
Copy link

o-sdn-o commented Jul 3, 2021

Responsible for interruption at the end of the grapheme cluster is the following check

if(prevw && uc_is_grapheme_break(prevw, wc)){
break; // starts a new EGC, exit and do not claim
}

It turns out that the string terminator is needed only at the end of the entire UTF-8 line.

@o-sdn-o
Copy link

o-sdn-o commented Jul 3, 2021

gclust = UTF-8 line = (EGC1) + ... + (EGCn) + string_terminator

where
EGC = codepoint + ... + codepoint
codepoint = byte + ... + byte

after calling ncplane_putegc( ... const char* gclust, int* sbytes) sbytes contains the length of EGC1 in bytes

after that you can
gclust += sbytes;

gclust = (EGC2) + ... + (EGCn) + string_terminator

@tomek-szczesny
Copy link
Contributor Author

I have an SSH stream (or any other terminal stream) that is a mixture of plain text, terminal escape codes and possibly UTF-8 EGCs here and there. I'm making a parser for whatever I may stumble across. What I want to do is to recognize UTF-8 codepoints, so these are properly rendered, and properly handled if they happen to be split across buffer iterations.

So, no, I'm not dealing with well defined UTF-8 string or anything like that. I need an answer to a question exactly like it has been asked. If my parser detects UTF-8 EGC, I have to put EGC into the ncplane, and I need to know the most efficient way for the sake of putvt() functionality I'm working at. I'll try just shoving it into putegc() and see if it work, but even if it does, the documentation shall get an update, I guess?

@dankamongmen
Copy link
Owner

UTF-8 can be unambiguously parsed without any sort of termination to get a single codepoint.
In this case, the wiki is talking about a sequence of Unicode code points that form a grapheme cluster.

i was unaware that the talented @o-sdn-o was reading our bugs, but welcome! and he is correct. an individual encoded Unicode codepoint can be lexed without explicit termination, but this is not true for EGCs.

@o-sdn-o
Copy link

o-sdn-o commented Jul 3, 2021

I'm not dealing with well defined UTF-8 string or anything like that. I need an answer to a question exactly like it has been asked. If my parser detects UTF-8 EGC, I have to put EGC into the ncplane, and I need to know the most efficient way for the sake of putvt() functionality I'm working at.

The value of the first byte in the string determines what to retrieve first. If the first byte is a control byte, treat it and the necessary subsequent bytes as an escape sequence. Otherwise, this is the beginning of a grapheme cluster.

@o-sdn-o
Copy link

o-sdn-o commented Jul 3, 2021

It is important to keep in mind the following points:

  • a byte can be an ASCII control character C0
  • can be the first byte of a UTF-8-encoded control code point
  • or a C1 control character

@dankamongmen
Copy link
Owner

@tomek-szczesny we've got functions for most of this kind of thing. what you almost certainly want for this is to use nccell_load() on the input stream, breaking it up into nccells at EGC boundaries, and then use ncplane_putc() to put those nccells down onto the output plane. this would require keeping your own idea of current color around, since nccell encompasses color. if you didn't want to do that, use ncplane_putegc() on the input stream directly. you'll get a return value indicating the number of columns consumed, and the sbytes parameter will hold the number of bytes consumed.

@dankamongmen
Copy link
Owner

dankamongmen commented Jul 3, 2021

the following will take an input stream and spit it out as EGCs to a plane (not tested):

// returns columns consumed, or -1 on invalid EGC / out of output space
int spray_utf8_egcs(const char* utf8text, struct ncplane* n, int* sbytes){
  int cols = 0;
  *sbytes = 0;
  while(*utf8text){
    int b, c;
    if((c = ncplane_putegc(n, utf8text, &b)) <= 0){
      return -1;
    }
    utf8text += b;
    *sbytes += b;
    cols += c;
  }
  return cols;
}

@dankamongmen
Copy link
Owner

the following will take an input stream and spit it out as EGCs to a plane (not tested):

// returns columns consumed, or -1 on invalid EGC / out of output space
int spray_utf8_egcs(const char* utf8text, struct ncplane* n, int* sbytes){
  int cols = 0;
  *sbytes = 0;
  while(*utf8text){
    int b, c;
    if((c = ncplane_putegc(n, utf8text, &b)) <= 0){
      return -1;
    }
    utf8text += b;
    *sbytes += b;
    cols += c;
  }
  return cols;
}

@tomek-szczesny , this is probably the easiest and fastest way to do what you want, if i understand you correctly. most of what you're doing could pretty much be this function plus an escape check prior to the ncplane_putegc(), diverting into extraction and dispatch of said control sequence. in fact...yeah, i hope that's exactly how you're writing this (ideally in a way that can be fed streaming data, which is a bit more difficult, and you can hold off on if you'd like).

@dankamongmen
Copy link
Owner

It is important to keep in mind the following points:

  • a byte can be an ASCII control character C0
  • can be the first byte of a UTF-8-encoded control code point
  • or a C1 control character

a good point is raised here -- all the ncplane_*() output functions will reject a C0 character (except for '\n' iff the plane is set to scroll, and horizontal tab in the near future). see is_control_egc():

// is it a control character? check C0 and C1, but don't count empty strings,                              
// nor single-byte strings containing only a NUL character.                                                
static inline bool                                                                                         
is_control_egc(const unsigned char* egc, int bytes){                                                       
  if(bytes == 1){                                                                                          
    if(*egc && iscntrl(*egc)){                                                                             
      return true;                                                                                         
    }                                                                                                      
  }else if(bytes == 2){                                                                                    
    // 0xc2 followed by 0x80--0x9f are controls. 0xc2 followed by <0x80 is                                 
    // simply invalid utf8.                                                                                
    if(egc[0] == 0xc2){                                                                                    
      if(egc[1] < 0xa0){                                                                                   
        return true;                                                                                       
      }                                                                                                    
    }                                                                                                      
  }                                                                                                        
  return false;                                                                                            
}         

@dankamongmen
Copy link
Owner

utf8_egc_len() is also relevant, as it implements our segmentation algorithm:

// Eat an EGC from the UTF-8 string input, counting bytes and columns. We use                              
// libunistring's uc_is_grapheme_break() to segment EGCs. Writes the number of                             
// columns to '*colcount'. Returns the number of bytes consumed, not including                             
// any NUL terminator. Neither the number of bytes nor columns is necessarily                              
// equal to the number of decoded code points. Such are the ways of Unicode.                               
// uc_is_grapheme_break() wants UTF-32, which is fine, because we need wchar_t                             
// to use wcwidth() anyway FIXME except this doesn't work with 16-bit wchar_t!                             
static inline int                                                                                          
utf8_egc_len(const char* gcluster, int* colcount){                                                         
  size_t ret = 0;                                                                                          
  *colcount = 0;                                                                                           
  int r;                                                                                                   
  mbstate_t mbt;                                                                                           
  memset(&mbt, 0, sizeof(mbt));                                                                            
  wchar_t wc, prevw = 0;                                                                                   
  do{                                                                                                      
    r = mbrtowc(&wc, gcluster, MB_CUR_MAX, &mbt);                                                          
    if(r < 0){                                                                                             
      return -1;                                                                                           
    }                                                                                                      
    if(prevw && uc_is_grapheme_break(prevw, wc)){                                                          
      break; // starts a new EGC, exit and do not claim                                                    
    }                                                                                                      
    int cols = wcwidth(wc);                                                                                
    if(cols < 0){                                                                                          
      if(iswspace(wc)){ // newline or tab                                                                  
        return ret + 1;                                                                                    
      }                                                                                                    
      return -1;                                                                                           
    }                                                                                                      
    *colcount += cols;                                                                                     
    ret += r;                                                                                              
    gcluster += r;                                                                                         
    prevw = wc;                                                                                            
  }while(r);                                                                                               
  return ret;                                                                                              
}                  

@dankamongmen
Copy link
Owner

utf8_egc_len() is also relevant, as it implements our segmentation algorithm:

// Eat an EGC from the UTF-8 string input, counting bytes and columns. We use                              

note that this algorithm is imperfect, because (a) wcwidth() is imperfect and (b) i don't think this accounts for i.e. ZWJ composed emoji (maybe it does? not for width, though). i wouldn't worry about that. see #1329. @joseluis gets mad at me about this every so often, but doing it perfectly would require knowledge of the (a) font (b) font rendering enginer and (c) terminal font system, and fuck all that noise.

@dankamongmen dankamongmen self-assigned this Jul 3, 2021
@dankamongmen dankamongmen added this to the 3.0.0 milestone Jul 3, 2021
@dankamongmen
Copy link
Owner

i think this is about everything needed to be said? closing this up. good discussion.

@tomek-szczesny
Copy link
Contributor Author

Well, that's more of a mess than I was hoping for. No wonder why I ended up as an electronics engineer, where no fucked up heritage of a dozen of character encodings clogs up efficient development.
I'll look into that tomorrow, and clean up my code before making it public, because I see expectations lurking here and there.. ;)
Thanks for the tips, most interesting indeed. I'll try to make use of that later on. Just need a sample output that actually uses anything besides ASCII, for testing. Hm, would be far easier if I knew Russian or something ;)

@dankamongmen
Copy link
Owner

Well, that's more of a mess than I was hoping for. No wonder why I ended up as an electronics engineer, where no fucked up heritage of a dozen of character encodings clogs up efficient development.
I'll look into that tomorrow, and clean up my code before making it public, because I see expectations lurking here and there.. ;)
Thanks for the tips, most interesting indeed. I'll try to make use of that later on. Just need a sample output that actually uses anything besides ASCII, for testing. Hm, would be far easier if I knew Russian or something ;)

i mean, i would hope your code can faithfully reproduce your own last name =]. i'm happy to look over ASCII-only code, but i'm not going to merge it in that condition. full unicode support is a fundamental feature of Notcurses.

@dankamongmen
Copy link
Owner

i mean, i would hope your code can faithfully reproduce your own last name =]. i'm happy to look over ASCII-only code, but i'm not going to merge it in that condition. full unicode support is a fundamental feature of Notcurses.

and while having to deal with character encodings is indeed one of the less pleasant elements of computer science, we make that back through those old watchwords, modularity and encapsulation. by using the functions mentioned, your code oughtn't need know anything about unicode other than "i need to use these functions to segment EGCs". if there's anything missing, let me know, but i think i've fleshed out the whole unicode/EGC thing pretty thoroughly, and unit tested the hell out of it. so just make sure you're using the functionality available, and it ought not be much more difficult than the ASCII-only equivalent. =]

@tomek-szczesny
Copy link
Contributor Author

i mean, i would hope your code can faithfully reproduce your own last name =]. i'm happy to look over ASCII-only code, but i'm not going to merge it in that condition. full unicode support is a fundamental feature of Notcurses.

Ideally my code should play notcurses-demo inside ncplane, including the unicode orgy. No worries, I'm not giving up on a functionality just because I have a hard time understanding it.
I guess I may benefit from reading that chapter from your book again.

The sole reason why I need to deal with UTF-8 stuff is to protect the continuity of the byte stream. My code must be aware of any multi-byte chunk (be it UTF-8 or VTspeak) and carry over the unresolved stub in front of the next buffer content. I did that with a few SGRs and it works pretty well even with 16-byte buffer.
The function returning a byte length of a codepoint would be most helpful, if it can draw conclusions just by looking at the first byte. I cannot guarantee the whole codepoint is in the buffer.

dankamongmen added a commit that referenced this issue Jul 4, 2021
@dankamongmen
Copy link
Owner

The function returning a byte length of a codepoint would be most helpful, if it can draw conclusions just by looking at the first byte. I cannot guarantee the whole codepoint is in the buffer.

yeah this makes total sense, let me whip something up. done.

@tomek-szczesny
Copy link
Contributor Author

Awesome, thanks!

@o-sdn-o
Copy link

o-sdn-o commented Jul 4, 2021

add utf8_codepoint_length() #1871

Invalid ranges for UTF-8 first bytes:

  • [0x80, 0xc1]
  • [0xf5, 0xff]

These bytes cannot appear in valid UTF-8. For invalid bytes, the length must be either 0 or 1.

@o-sdn-o
Copy link

o-sdn-o commented Jul 4, 2021

Consider a table lookup

    // utf: First byte based UTF-8 codepoint lengths.
    int utf8lengths[] =
    {	//      0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F
        /* 0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 1 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 2 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 3 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 4 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 5 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 6 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 7 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 8 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 9 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* A */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* B */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* C */ 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        /* D */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        /* E */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        /* F */ 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    };

@dankamongmen
Copy link
Owner

add utf8_codepoint_length() #1871

Invalid ranges for UTF-8 first bytes:

  • [0x80, 0xc1]
  • [0xf5, 0xff]

These bytes cannot appear in valid UTF-8. For invalid bytes, the length must be either 0 or 1.

ooooh indeed, fixed up, thanks! https://unicode.org/versions/corrigendum1.html

@dankamongmen
Copy link
Owner

Consider a table lookup

    // utf: First byte based UTF-8 codepoint lengths.
    int utf8lengths[] =
    {	//      0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F
        /* 0 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 1 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 2 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 3 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 4 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 5 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 6 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 7 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        /* 8 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 9 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* A */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* B */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* C */ 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        /* D */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        /* E */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
        /* F */ 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    };

i would want to microbenchmark before doing any such thing. i personally doubt that even three branch mispredictions (for the four byte case) is going to come anywhere close to a cacheline fill for this 2KB (32 64B cache lines).

i'd further suspect that at least one compiler out of gcc, clang, and icc can generate this lookup table from the if statements if it's indeed the optimal way to do things =]. a partitioning conditional of the klind i have looks just like this to compiler internals anyway.

and this is far less debuggable =]

@dankamongmen
Copy link
Owner

i would want to microbenchmark before doing any such thing. i personally doubt that even three branch mispredictions (for the four byte case) is going to come anywhere close to a cacheline fill for this 2KB (32 64B cache lines).

i'd further suspect that at least one compiler out of gcc, clang, and icc can generate this lookup table from the if statements if it's indeed the optimal way to do things =]. a partitioning conditional of the klind i have looks just like this to compiler internals anyway.

and this is far less debuggable =]

to be clear, i appreciate the suggestion, and your verve for lookup tables. i just personally bet that (in a benchmark that evicts cachelines regularly, so the entire 2KB doesn't sit untrammeled in L1) the few branch mispredictions are at most 3xpipeline cycles lost (call it 60 cycles), whereas pulling a cacheline in from DRAM is going to be hundreds of cycles minimum. assume that this structure gets wholly evicted between calls (quite plausible). there's a single compulsory cache miss, so unless you think this table is not going to be evicted, you're betting AMAT vs pipeline length. i'll take pipeline length every time, or at least with every processor released since 2000.

@o-sdn-o
Copy link

o-sdn-o commented Jul 4, 2021

i would want to microbenchmark before doing any such thing. i personally doubt that even three branch mispredictions (for the four byte case) is going to come anywhere close to a cacheline fill for this 2KB (32 64B cache lines).

It seems to me that the function call is too expensive and, if used, it must be explicitly inlined. Perhaps the compiler will do it, but it probably depends on where the function is used.

It also seems to me that there are a lot of other calls between successive calls to check the UTF-8 length. This overrides any optimization.

Here you need to profile each specific case. It is not known how the card will fall. 🙂

@tomek-szczesny
Copy link
Contributor Author

Would that help if I promised not to call it if i'm >5 bytes away from the end of the buffer? :D

@o-sdn-o
Copy link

o-sdn-o commented Jul 4, 2021

Would that help if I promised not to call it if i'm >5 bytes away from the end of the buffer? :D

You have to find out the length anyway.

@tomek-szczesny
Copy link
Contributor Author

Not if I'm away from the end of the buffer - in that case I may safely call putegc().

@dankamongmen
Copy link
Owner

It seems to me that the function call is too expensive and, if used, it must be explicitly inlined. Perhaps the compiler will do it, but it probably depends on where the function is used.

it's static inline! i do not expect to see this function ever show up as an identifier

It also seems to me that there are a lot of other calls between successive calls to check the UTF-8 length. This overrides any optimization.
Here you need to profile each specific case. It is not known how the card will fall.

of course =]

@tomek-szczesny
Copy link
Contributor Author

Yay, I implemented UTF-8 yesterday in my proto-VT.
Indeed I successfully used putegc() pointed at specific places in my buffer and it parsed the underlying EGCs no problem, EXCEPT of one case.
For whatever reason, putegc() always fails if pointed at the last complete codepoint in the buffer, if the following codepoint is incomplete, as far as I can tell.
I didn't find the underlying cause, I just assumed that if putegc() fails, it's time to wrap up the state machine party and deal with the rest of the buffer on next iteration.

image

@o-sdn-o
Copy link

o-sdn-o commented Jul 7, 2021

it's static inline! i do not expect to see this function ever show up as an identifier

I was wrong. I missed this detail.

@o-sdn-o
Copy link

o-sdn-o commented Jul 7, 2021

Just need a sample output that actually uses anything besides ASCII, for testing. Hm, would be far easier if I knew Russian or something ;)

@tomek-szczesny Here you can find cool samples to test your vt-parser https://16colo.rs/

@dankamongmen
Copy link
Owner

@tomek-szczesny Here you can find cool samples to test your vt-parser https://16colo.rs/

love it

@tomek-szczesny
Copy link
Contributor Author

@tomek-szczesny Here you can find cool samples to test your vt-parser https://16colo.rs/

That's a nice collection, but I'm pretty sure ASCII chars and 16 colors are well tested by now. ;)

If you guys stumble across any terminal program that generates static 24-bit color output, let me know. For now this is the unchallenged feature that in theory is implemented. :)

@o-sdn-o
Copy link

o-sdn-o commented Jul 7, 2021

If you guys stumble across any terminal program that generates static 24-bit color output, let me know.

You can use GIMP. There it is possible to export a trucolored image to a C-struct.

Menu File->Export As... -> Select File Type C source code -> Export

Replace exported C-struct at the beginning of the next code and profit. 😀

static const struct /* GIMP RGB C-Source image dump (Untitled2.c) */
{
    int width;
    int height;
    int bytes_per_pixel; /* 2:RGB16, 3:RGB, 4:RGBA */
    unsigned char pixel_data[10 * 10 * 3 + 1];
} gimp_image = {
  10, 10, 3,
  "\377\374\374\377\371\371\377\317\317\377\305\305\377\250\250\377\247\247"
  "\377\250\250\377\311\311\377\345\345\377\376\376\377\307\307\377\300\300"
  "\377YY\377$%\377\036\036\377\036\036\377\037\037\377\064\064\377\226\226\377\372"
  "\372\377\374\374\377\374\374\374\356\361\347y\245\361Ae\377@A\377[[\371\315"
  "\322\370\362\362\377\377\377\377\377\377\376\376\376\350\350\370Z:\361\276"
  "L\236\373>@\366\251\263m_\363\232\232\245\370\370\370\376\376\376\345\345"
  "\354ff\354eS\371\344Yw\366\201\207\213u\350{{\362\341\341\350\376\376\376"
  "\374\374\374\276\276\314\\\\\372\216\216\375\236\224\242\325\304\343\242"
  "Q\310\354\354\370\376\376\376\377\377\377\374\374\377\316\316\375\261\261"
  "\376\257\257\365\253\231\262\353\035o\377\003\003\377\334\334\377\377\377\377"
  "\377\377\374\374\377\316\316\377qq\377\202\202\372YR\363\317\235\257\375"
  "ss\377\366\366\377\377\377\377\377\377\377\377\377\374\374\377\354\354\377"
  "II\377\351\351\374\373\373\373\377\366\366\377\377\377\377\377\377\377\377"
  "\377\377\377\377\377\377\377\377\377\377\363\363\377\377\377\377\377\377"
  "\377\377\377\377\377\377\377\377\377\377\377\377\377",
};

#include <iostream>
#include <string>

struct rgb
{
    unsigned char r;
    unsigned char g;
    unsigned char b;
    bool operator ==(rgb const& c) { return c.r == r && c.g == g && c.b == b; }
    bool operator !=(rgb const& c) { return !operator==(c); }
};
std::string fgc(rgb c)
{
    return "\033[38;2;" + std::to_string(c.r) + ';'
                        + std::to_string(c.g) + ';'
                        + std::to_string(c.b) + 'm';
}
std::string bgc(rgb c)
{
    return "\033[48;2;" + std::to_string(c.r) + ';'
                        + std::to_string(c.g) + ';'
                        + std::to_string(c.b) + 'm';
}
int main()
{
    int w = gimp_image.width;
    int h = gimp_image.height;
    int s = gimp_image.bytes_per_pixel;
    int i = 0;     // upper line
    int j = w * s; // lower line
    std::string result;
    rgb old_bg = {};
    rgb old_fg = {};
    result += bgc(old_bg) + fgc(old_fg);
    for (int y = 0; y < h; y += 2)
    {
        for (int x = 0; x < w; x++)
        {
            rgb bg = rgb{ gimp_image.pixel_data[i + 0],
                          gimp_image.pixel_data[i + 1],
                          gimp_image.pixel_data[i + 2] };
            rgb fg = rgb{ gimp_image.pixel_data[j + 0],
                          gimp_image.pixel_data[j + 1],
                          gimp_image.pixel_data[j + 2] };
            if (bg == fg) 
            {
                if (bg != old_bg)
                {
                    old_bg = bg;
                    result += bgc(bg);
                }
                result += " ";
            }
            else 
            {
                if (bg != old_bg)
                {
                    old_bg = bg;
                    result += bgc(bg);
                    if (fg != old_fg)
                    {
                        old_fg = fg;
                        result += fgc(fg);
                    }
                }
                else
                {
                    if (fg != old_fg)
                    {
                        old_fg = fg;
                        result += fgc(fg);
                    }
                }
                result += "";
            }
            i += s;
            j += s;
        }
        result += "\033[m\n" + bgc(old_bg) + fgc(old_fg);
        i += w * s;
        j += w * s;
    }
    result += "\033[m";
    std::cout << result;
}

Ooops, a small mistake, the height of the picture must be an even number.

There are some samples here https://gist.github.com/XVilka/8346728

@tomek-szczesny
Copy link
Contributor Author

or just dissect a part of notcurses-info :)

scrn-2021-07-07-17-07-37

24-bit colors and Unicode seem to work fine, now this shows there's plenty more missing. ;)

@o-sdn-o
Copy link

o-sdn-o commented Jul 7, 2021

Parsing of cursor positioning commands does not work

ESC [ row ; column H
image
image

DECSET/DECRST - ESC [ ? n h/ESC [ ? n l
image

and a couple of others
image

See
https://invisible-island.net/xterm/ctlseqs/ctlseqs.html

@tomek-szczesny
Copy link
Contributor Author

That is all true. When moving cursor around is supported, I think I'll finally be able to run htop LIVE in ncplane! ^^

@o-sdn-o
Copy link

o-sdn-o commented Jul 7, 2021

htop's first frame repertoire:

\e]0; ... \007
\e[?1049h                                                                                
\e[22;0;0t
\e[1;27r
\e(B
\e[m
\e[4l
\e[?7h
\e[?1h
\e=
\e[?25l 
\e[?1000h 
\e[2J
\e[J
\e[2d
\e[30X    
\e[K
\e[?12l
\e[?25h
\e[?1000l
\e[?1049l
\e[?1l

Essential:
\e[m \e[2j \e[J \e[2d \e[30X \e[K \e[H

@tomek-szczesny
Copy link
Contributor Author

I guess you are very fond of that VTM toy of yours. :)
Thanks, a lot, that's a pretty handy checklist alright.

@dankamongmen
Copy link
Owner

VTM is awesome!

@o-sdn-o
Copy link

o-sdn-o commented Jul 7, 2021

VTM is awesome!

Thank you! I hope this multiplexer will be useful to someone for pair programming. Its main functionality is session life sharing (via SSH or somehow).

@tomek-szczesny
Copy link
Contributor Author

tomek-szczesny commented Jul 7, 2021

I have just issued the most fucked up vim command ever. And it worked as expected!
scrn-2021-07-07-18-37-42

// TODO:
//
// \e[m                 // SGR (TODO: Default argument)
// \e[2J                // Erase in display (args 0-3) 
// \e[J                 // Erase in display 0
// \e[2d                // Line Position Absolute (Default 1)
// \e[30X               // Erase 30 characters (Default 1)
// \e[K                 // Erase in line, args 0-2 (default 0)
// \e[y;xH              // Move cursor to y,x
// \e]0; ... \007       // ESC ] = OSC, terminated with BEL (0x07) or ST (0x1b \), or nothing
// \e[?1049h            // Alternative screen buffer
// \e[?1049l            // Disable alternative screen buffer
// \e[1;27r             // Set scrolling region (from, to) (default top, bottom)
// \e[4h                // Set Mode (12 = Send/Receive; 20 = automatic newline; 4 = insert mode; +1)
// \e[4l                // Reset Mode (2 = Keyboard Action Mode, 4 = Replace mode; +2)
// \e[?7h               // Auto wrap mode (DECAWM)
// \e[?25h              // Show cursor
// \e[?25l              // Hide cursor
// \e[?1000h            // Send Mouse X & Y on button press and release. This is the X11 xterm mouse protocol.
// \e[?1000l            // Don't send...
//
// Essential:
// \e[m \e[2j \e[J \e[2d \e[30X \e[K \e[H
//
// WTF SEQUENCES:
// \e=                  // Application Keypad (DECKPAM)
// \e[?1h               // Application cursor keys (DECCKM)
// \e[?1l               // Normal Cursor Keys
//
// WON'T IMPLEMENT:
// \e(B                 // G0 character set -> USASCII 
// \e[22;0;0t           // Window Manipulation (XTWINOPS)
// \e[?12l              // Start/Stop blinking cursor

What have I done...

@o-sdn-o
Copy link

o-sdn-o commented Jul 7, 2021

// WTF SEQUENCES:
// \e= // Application Keypad (DECKPAM)
// \e[?1h // Application cursor keys (DECCKM)
// \e[?1l // Normal Cursor Keys

As far as I know, these sequences change the mode/format of the keystrokes that are sent by the terminal to the application.

@tomek-szczesny
Copy link
Contributor Author

@o-sdn-o , I invite you to share your VTspeak knowledge in my dedicated repo :)
https://github.com/tomek-szczesny/notcurses-vt-proto/issues
Some issues are marked as "question" or "discussion", but feel free to to explore them all or add whatever you feel is useful.
@dankamongmen, you may want to watch this repo too, don't feel ignored. :)

@o-sdn-o
Copy link

o-sdn-o commented Jul 8, 2021

@dankamongmen dankamongmen modified the milestones: 3.0.0, 2.4.0 Aug 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
userquestion not quite bugs--inquiries from users
Projects
None yet
Development

No branches or pull requests

3 participants