-
Notifications
You must be signed in to change notification settings - Fork 8.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fully regenerate CodepointWidthDetector from Unicode 13.0 #8035
Conversation
This commit also adds an override UCD and migrates all of the overrides from GetQuickCharWidth into it. GetQuickCharWidth ----------------- GetQuickCharWidth incorrectly marked 67 resrved codepoints as Wide when they should have been Narrow. This reduces the number of comparisons required for looking up a single character's width from 40 (32 individual ranged comparisons from GQCW + 8 from the binary search in CPWD) to 10 (2 from GQCW, 8 from CPWD). This change also offers us the change to document exactly why we're overriding a specific character range. The codepoints whose definitions have changed from Wide to Narrow are: ``` 2E9A 2EF4 2EF5 2EF6 2EF7 2EF8 2EF9 2EFA 2EFB 2EFC 2EFD 2EFE 2EFF 2FD6 2FD7 2FD8 2FD9 2FDA 2FDB 2FDC 2FDD 2FDE 2FDF 2FE0 2FE1 2FE2 2FE3 2FE4 2FE5 2FE6 2FE7 2FE8 2FE9 2FEA 2FEB 2FEC 2FED 2FEE 2FEF 2FFC 2FFD 2FFE 2FFF 31E4 31E5 31E6 31E7 31E8 31E9 31EA 31EB 31EC 31ED 31EE 31EF 321F A48D A48E A48F FE1A FE1B FE1C FE1D FE1E FE1F FE53 FE67 ``` All of them are reserved. New in Unicde 13.0 ------------------ Some widths have changed due to previously-reserved characters becoming _used_ such U+32FF SQUARE ERA NAME REIWA, the Tangut components 756-768, the entire Khitan Small Script character set, and the Tangut Ideographs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, seems like a good idea to me
@@ -0,0 +1,9 @@ | |||
<?xml version="1.0" encoding="utf-8" standalone="yes"?> | |||
<ucd xmlns="http://www.unicode.org/ns/2003/ucd/1.0"> | |||
<repertoire> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talk about a tough tag name to spell correctly...
@@ -354,179 +354,16 @@ std::deque<std::unique_ptr<KeyEvent>> SynthesizeNumpadEvents(const wchar_t wch, | |||
// May-01-2019 MiNiksa Forced lookup-via-renderer for retroactively recategorized emoji | |||
// that used to be narrow but now might be wide. (approx x2194-x2b55, not inclusive) | |||
// Also forced block characters segment (x2580-x259F) to narrow | |||
// Oct-25-2020 DuHowett Replaced the entire table with a set of overrides that get built into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this plan, overrides declarative not in here.
Hello @DHowett! Because this pull request has the p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (
|
🎉 Handy links: |
* Run all images through ImgBot (CC-8169) * Fix potential over/underflow as noted by "TODO:" comment (CC-8081) * Fix garbling when copying multibyte text via OSC 52 (CC-7870) * UIA: throw E_FAIL for out-of-bounds text (CC-8052) * Consider the GlyphWidth when calculate the postion of matched word in URL detecting (CC-8124) * Make the link underline less obtrusive; don't use it for pattern (GH-8148) * Fully regenerate CodepointWidthDetector from Unicode 13.0 (GH-8035) * Prepare for the primary branch name to change to main (GH-7985) * Hash the URI as part of the hyperlink ID (GH-7940) * Introduce til::presorted_static_map (GH-7640) * Prevent leftover cursor fragments when scrolling in PowerShell (CC-8173) * Add support for the DECREQTPARM report (CC-7939) * Refactor VT parameter handling (CC-7799) * Add support for the "blink" graphic rendition attribute (CC-7490) * Combine the parsing & dispatch blocks for OSC actions (CC-8202) * Add support for autodetecting URLs and making hyperlinks (CC-7691) * Copy _currentHyperlinkId when copying the buffer (CC-8074) * Fix the "visual representation" optimization for hyperlinks (CC-7738) * Optimize the binary size of the XOrg color table (CC-7929) * Add support for more OSC color formats (CC-7578) Related work items: MSFT-30259074
If anyone knows, please can you let me know which version of Unicode WIndows Terminal currently supports, thanks. |
@plastikfan do you have any reason to believe that it is not, in fact, 13.0? |
No not all, but if thats the answer then great. Please confirm |
In the issue you linked to this pull request, there is a discussion about the version of Unicode that is supported by the Windows Console. The version that Windows Terminal (and OpenConsole) supports doesn't mean anything for FluentTerm, because they are not using Windows Terminal (or OpenConsole)... so you're somewhat barking up the wrong tree in asking. However: the inbox Windows console is updated from this repository, and this change made it into builds numbered 21xxx. Before that, we supported some smattering of Unicode versions 9, 10, and 11. |
Sorry, I think you mis-intepret me. All I'm trying to do is assist in researching why emoji display is not working properly in Fluent terminal. I was asked to find out what version of Unicode Windows terminal is currently using (so that the correct so called unicode addon can be loaded into xterm) and with your info, this is done. Thank you for responding. |
If xterm.js supports Unicode 13, and Fluent Terminal is using the traditional in-box Windows console, it will disagree about the size of a character with xterm.js which will cause visual artifacts. |
…harWidth for numpad event synthesis When we encounter clipboard input that cannot be mapped to a keyboard key, we try our best to map it to an event. If if is an alphanumeric character or a wide glyph (per the old width algorithm), we will pass it through as the UnicodeChar of a key event. If it's *not* wide and *not* alphanumeric, we will synthesize a set of Alt+NumPad events. Those events comprise a set containing... 1. Alt Down (modifiers = LAlt, UnicodeChar = 0) 2. Numpad 0 Down/Up (modifiers = LAlt, UnicodeChar = 0) 3. Numpad 1 Down/Up (modifiers = LAlt, UnicodeChar = 0) 4. Numpad 2 Down/Up (modifiers = LAlt, UnicodeChar = 0) 5. Alt Up (modifiers = 0, UnicodeChar = [THE CHARACTER]) Because of event group 5, application developers seem to have taken a dependency on receiving Alt Up + Character and don't seek to recompose the original character from its numpad representation. In pull request GH-8035 (!5394370), we stripped the old width algorithm out and replaced it with a lookup table (finally!). Unfortunately, that broke clipboard input for Chinese text as it was no longer considered "Wide" for the purposes of detecting whether we should use numpad events. This commit introduces a version of GetQuickCharWidth that fulfills the exact contract CharToKeyEvents needs, and doesn't answer for a codepoint more. We'll use it in Windows to fix MSFT-32901370. The Terminal analogue of this bug, GH-9052, is fixed by *never emitting numpad events again.* We think this is okay because it looks like nobody was ever handling numpad events... and that we were only using them as a way to communicate within conhost (which we're *also* not using) and any public exposition of event 5 as a contract was unintended. VALIDATION ---------- I took this new implementation (with an early return) and the old implementation and compared whether they would yield a numpad event or a key event over the entire supported codepoint space [0000..FFFF]. They matched. Fixes MSFT-32901370 Fixes GH-9052 Retrieved from https://microsoft.visualstudio.com os.2020 OS official/rs_wdx_dxp_windev 224751bbb061f5f59d794c2c9bdac5a9674ebde6
This commit also adds an override UCD and migrates all of the overrides
from GetQuickCharWidth into it.
GetQuickCharWidth
The removal of overrides from GQCW reduces the number of comparisons
required for looking up a single character's width from 41 (32
individual ranged comparisons from GQCW + 8+1 from the binary search in
CPWD) to 11 (2 from GQCW, 8+1 from CPWD).
GQCW also incorrectly marked 67 reserved codepoints as
Wide
when theyshould have been
Narrow
.The codepoints whose definitions have changed from
Wide
toNarrow
are:All of them are reserved, but those reserved regions are marked as narrow
in the UCD.
This change also offers us the chance to document exactly why we're
overriding a specific character range. Comments from the override
document will be copied to the generated CPWD table.
New in Unicode 13.0
Some widths have changed due to previously-reserved characters becoming
used such as U+32FF SQUARE ERA NAME REIWA, the Tangut components
756-768, the entire Khitan Small Script character set, and the Tangut
Ideographs.
A number of the changes in this diff are due to better/worse comment
tracking and the removal of the Emoji/EPres comments. The script once
mistakenly applied comments to packed regions (and it has been updated
to not do so.)
Validation
I build a test application that compared codepoints 0-FFFF for GQCW
against their new registered widths.