Fully regenerate CodepointWidthDetector from Unicode 13.0 #8035

DHowett · 2020-10-25T07:47:50Z

This commit also adds an override UCD and migrates all of the overrides
from GetQuickCharWidth into it.

GetQuickCharWidth

The removal of overrides from GQCW reduces the number of comparisons
required for looking up a single character's width from 41 (32
individual ranged comparisons from GQCW + 8+1 from the binary search in
CPWD) to 11 (2 from GQCW, 8+1 from CPWD).

GQCW also incorrectly marked 67 reserved codepoints as Wide when they
should have been Narrow.

The codepoints whose definitions have changed from Wide to Narrow are:

2E9A 2EF4 2EF5 2EF6 2EF7 2EF8 2EF9 2EFA 2EFB 2EFC 2EFD 2EFE 2EFF 2FD6
2FD7 2FD8 2FD9 2FDA 2FDB 2FDC 2FDD 2FDE 2FDF 2FE0 2FE1 2FE2 2FE3 2FE4
2FE5 2FE6 2FE7 2FE8 2FE9 2FEA 2FEB 2FEC 2FED 2FEE 2FEF 2FFC 2FFD 2FFE
2FFF 31E4 31E5 31E6 31E7 31E8 31E9 31EA 31EB 31EC 31ED 31EE 31EF 321F
A48D A48E A48F FE1A FE1B FE1C FE1D FE1E FE1F FE53 FE67

All of them are reserved, but those reserved regions are marked as narrow
in the UCD.

This change also offers us the chance to document exactly why we're
overriding a specific character range. Comments from the override
document will be copied to the generated CPWD table.

New in Unicode 13.0

Some widths have changed due to previously-reserved characters becoming
used such as U+32FF SQUARE ERA NAME REIWA, the Tangut components
756-768, the entire Khitan Small Script character set, and the Tangut
Ideographs.

A number of the changes in this diff are due to better/worse comment
tracking and the removal of the Emoji/EPres comments. The script once
mistakenly applied comments to packed regions (and it has been updated
to not do so.)

Validation

I build a test application that compared codepoints 0-FFFF for GQCW
against their new registered widths.

This commit also adds an override UCD and migrates all of the overrides from GetQuickCharWidth into it. GetQuickCharWidth ----------------- GetQuickCharWidth incorrectly marked 67 resrved codepoints as Wide when they should have been Narrow. This reduces the number of comparisons required for looking up a single character's width from 40 (32 individual ranged comparisons from GQCW + 8 from the binary search in CPWD) to 10 (2 from GQCW, 8 from CPWD). This change also offers us the change to document exactly why we're overriding a specific character range. The codepoints whose definitions have changed from Wide to Narrow are: ``` 2E9A 2EF4 2EF5 2EF6 2EF7 2EF8 2EF9 2EFA 2EFB 2EFC 2EFD 2EFE 2EFF 2FD6 2FD7 2FD8 2FD9 2FDA 2FDB 2FDC 2FDD 2FDE 2FDF 2FE0 2FE1 2FE2 2FE3 2FE4 2FE5 2FE6 2FE7 2FE8 2FE9 2FEA 2FEB 2FEC 2FED 2FEE 2FEF 2FFC 2FFD 2FFE 2FFF 31E4 31E5 31E6 31E7 31E8 31E9 31EA 31EB 31EC 31ED 31EE 31EF 321F A48D A48E A48F FE1A FE1B FE1C FE1D FE1E FE1F FE53 FE67 ``` All of them are reserved. New in Unicde 13.0 ------------------ Some widths have changed due to previously-reserved characters becoming _used_ such U+32FF SQUARE ERA NAME REIWA, the Tangut components 756-768, the entire Khitan Small Script character set, and the Tangut Ideographs.

zadjii-msft

Sure, seems like a good idea to me

miniksa · 2020-10-27T17:34:11Z

src/types/unicode_width_overrides.xml

@@ -0,0 +1,9 @@
+<?xml version="1.0" encoding="utf-8" standalone="yes"?>
+<ucd xmlns="http://www.unicode.org/ns/2003/ucd/1.0">
+   <repertoire>


Talk about a tough tag name to spell correctly...

miniksa · 2020-10-27T17:34:40Z

src/types/convert.cpp

@@ -354,179 +354,16 @@ std::deque<std::unique_ptr<KeyEvent>> SynthesizeNumpadEvents(const wchar_t wch,
 // May-01-2019 MiNiksa  Forced lookup-via-renderer for retroactively recategorized emoji
 //                      that used to be narrow but now might be wide. (approx x2194-x2b55, not inclusive)
 //                      Also forced block characters segment (x2580-x259F) to narrow
+// Oct-25-2020 DuHowett Replaced the entire table with a set of overrides that get built into


I like this plan, overrides declarative not in here.

ghost · 2020-10-27T17:36:25Z

Hello @DHowett!

Because this pull request has the AutoMerge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.

ghost · 2020-11-11T18:12:44Z

🎉Windows Terminal Preview v1.5.3142.0 has been released which incorporates this pull request.:tada:

Handy links:

* Run all images through ImgBot (CC-8169) * Fix potential over/underflow as noted by "TODO:" comment (CC-8081) * Fix garbling when copying multibyte text via OSC 52 (CC-7870) * UIA: throw E_FAIL for out-of-bounds text (CC-8052) * Consider the GlyphWidth when calculate the postion of matched word in URL detecting (CC-8124) * Make the link underline less obtrusive; don't use it for pattern (GH-8148) * Fully regenerate CodepointWidthDetector from Unicode 13.0 (GH-8035) * Prepare for the primary branch name to change to main (GH-7985) * Hash the URI as part of the hyperlink ID (GH-7940) * Introduce til::presorted_static_map (GH-7640) * Prevent leftover cursor fragments when scrolling in PowerShell (CC-8173) * Add support for the DECREQTPARM report (CC-7939) * Refactor VT parameter handling (CC-7799) * Add support for the "blink" graphic rendition attribute (CC-7490) * Combine the parsing & dispatch blocks for OSC actions (CC-8202) * Add support for autodetecting URLs and making hyperlinks (CC-7691) * Copy _currentHyperlinkId when copying the buffer (CC-8074) * Fix the "visual representation" optimization for hyperlinks (CC-7738) * Optimize the binary size of the XOrg color table (CC-7929) * Add support for more OSC color formats (CC-7578) Related work items: MSFT-30259074

plastikfan · 2021-04-09T16:34:23Z

If anyone knows, please can you let me know which version of Unicode WIndows Terminal currently supports, thanks.

DHowett · 2021-04-09T16:35:21Z

@plastikfan do you have any reason to believe that it is not, in fact, 13.0?

plastikfan · 2021-04-09T16:36:22Z

No not all, but if thats the answer then great. Please confirm

DHowett · 2021-04-09T16:38:28Z

In the issue you linked to this pull request, there is a discussion about the version of Unicode that is supported by the Windows Console. The version that Windows Terminal (and OpenConsole) supports doesn't mean anything for FluentTerm, because they are not using Windows Terminal (or OpenConsole)... so you're somewhat barking up the wrong tree in asking.

However: the inbox Windows console is updated from this repository, and this change made it into builds numbered 21xxx. Before that, we supported some smattering of Unicode versions 9, 10, and 11.

plastikfan · 2021-04-09T16:47:50Z

Sorry, I think you mis-intepret me. All I'm trying to do is assist in researching why emoji display is not working properly in Fluent terminal. I was asked to find out what version of Unicode Windows terminal is currently using (so that the correct so called unicode addon can be loaded into xterm) and with your info, this is done. Thank you for responding.

DHowett · 2021-04-09T16:55:13Z

If xterm.js supports Unicode 13, and Fluent Terminal is using the traditional in-box Windows console, it will disagree about the size of a character with xterm.js which will cause visual artifacts.

…harWidth for numpad event synthesis When we encounter clipboard input that cannot be mapped to a keyboard key, we try our best to map it to an event. If if is an alphanumeric character or a wide glyph (per the old width algorithm), we will pass it through as the UnicodeChar of a key event. If it's *not* wide and *not* alphanumeric, we will synthesize a set of Alt+NumPad events. Those events comprise a set containing... 1. Alt Down (modifiers = LAlt, UnicodeChar = 0) 2. Numpad 0 Down/Up (modifiers = LAlt, UnicodeChar = 0) 3. Numpad 1 Down/Up (modifiers = LAlt, UnicodeChar = 0) 4. Numpad 2 Down/Up (modifiers = LAlt, UnicodeChar = 0) 5. Alt Up (modifiers = 0, UnicodeChar = [THE CHARACTER]) Because of event group 5, application developers seem to have taken a dependency on receiving Alt Up + Character and don't seek to recompose the original character from its numpad representation. In pull request GH-8035 (!5394370), we stripped the old width algorithm out and replaced it with a lookup table (finally!). Unfortunately, that broke clipboard input for Chinese text as it was no longer considered "Wide" for the purposes of detecting whether we should use numpad events. This commit introduces a version of GetQuickCharWidth that fulfills the exact contract CharToKeyEvents needs, and doesn't answer for a codepoint more. We'll use it in Windows to fix MSFT-32901370. The Terminal analogue of this bug, GH-9052, is fixed by *never emitting numpad events again.* We think this is okay because it looks like nobody was ever handling numpad events... and that we were only using them as a way to communicate within conhost (which we're *also* not using) and any public exposition of event 5 as a contract was unintended. VALIDATION ---------- I took this new implementation (with an early return) and the old implementation and compared whether they would yield a numpad event or a key event over the entire supported codepoint space [0000..FFFF]. They matched. Fixes MSFT-32901370 Fixes GH-9052 Retrieved from https://microsoft.visualstudio.com os.2020 OS official/rs_wdx_dxp_windev 224751bbb061f5f59d794c2c9bdac5a9674ebde6

DHowett requested a review from miniksa October 26, 2020 23:42

yes this is silly, no i do not care

ac609ea

zadjii-msft approved these changes Oct 27, 2020

View reviewed changes

miniksa reviewed Oct 27, 2020

View reviewed changes

miniksa approved these changes Oct 27, 2020

View reviewed changes

DHowett added the AutoMerge Marked for automatic merge by the bot when requirements are met label Oct 27, 2020

ghost merged commit 1df3182 into main Oct 27, 2020

ghost deleted the dev/duhowett/more_unicode branch October 27, 2020 17:36

chitoku-k mentioned this pull request Feb 5, 2021

Some Unicode characters are improperly accepted or dropped #9052

Closed

skyline75489 mentioned this pull request Mar 25, 2021

[DRAFT] Fix Chinese punctuations for vim (win32) #9613

Closed

6 tasks

plastikfan mentioned this pull request Apr 9, 2021

wcwidth rules for unicode 11 xtermjs/xterm.js#2568

Merged

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully regenerate CodepointWidthDetector from Unicode 13.0 #8035

Fully regenerate CodepointWidthDetector from Unicode 13.0 #8035

DHowett commented Oct 25, 2020 •

edited

Loading

zadjii-msft left a comment

miniksa Oct 27, 2020

miniksa Oct 27, 2020

ghost commented Oct 27, 2020

ghost commented Nov 11, 2020

plastikfan commented Apr 9, 2021

DHowett commented Apr 9, 2021

plastikfan commented Apr 9, 2021

DHowett commented Apr 9, 2021

plastikfan commented Apr 9, 2021

DHowett commented Apr 9, 2021

Fully regenerate CodepointWidthDetector from Unicode 13.0 #8035

Fully regenerate CodepointWidthDetector from Unicode 13.0 #8035

Conversation

DHowett commented Oct 25, 2020 • edited Loading

GetQuickCharWidth

New in Unicode 13.0

Validation

zadjii-msft left a comment

Choose a reason for hiding this comment

miniksa Oct 27, 2020

Choose a reason for hiding this comment

miniksa Oct 27, 2020

Choose a reason for hiding this comment

ghost commented Oct 27, 2020

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

ghost commented Nov 11, 2020

plastikfan commented Apr 9, 2021

DHowett commented Apr 9, 2021

plastikfan commented Apr 9, 2021

DHowett commented Apr 9, 2021

plastikfan commented Apr 9, 2021

DHowett commented Apr 9, 2021

DHowett commented Oct 25, 2020 •

edited

Loading

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.