Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully regenerate CodepointWidthDetector from Unicode 13.0 #8035

Merged
2 commits merged into from
Oct 27, 2020

Conversation

DHowett
Copy link
Member

@DHowett DHowett commented Oct 25, 2020

This commit also adds an override UCD and migrates all of the overrides
from GetQuickCharWidth into it.

GetQuickCharWidth

The removal of overrides from GQCW reduces the number of comparisons
required for looking up a single character's width from 41 (32
individual ranged comparisons from GQCW + 8+1 from the binary search in
CPWD) to 11 (2 from GQCW, 8+1 from CPWD).

GQCW also incorrectly marked 67 reserved codepoints as Wide when they
should have been Narrow.

The codepoints whose definitions have changed from Wide to Narrow are:

2E9A 2EF4 2EF5 2EF6 2EF7 2EF8 2EF9 2EFA 2EFB 2EFC 2EFD 2EFE 2EFF 2FD6
2FD7 2FD8 2FD9 2FDA 2FDB 2FDC 2FDD 2FDE 2FDF 2FE0 2FE1 2FE2 2FE3 2FE4
2FE5 2FE6 2FE7 2FE8 2FE9 2FEA 2FEB 2FEC 2FED 2FEE 2FEF 2FFC 2FFD 2FFE
2FFF 31E4 31E5 31E6 31E7 31E8 31E9 31EA 31EB 31EC 31ED 31EE 31EF 321F
A48D A48E A48F FE1A FE1B FE1C FE1D FE1E FE1F FE53 FE67

All of them are reserved, but those reserved regions are marked as narrow
in the UCD.

This change also offers us the chance to document exactly why we're
overriding a specific character range. Comments from the override
document will be copied to the generated CPWD table.

New in Unicode 13.0

Some widths have changed due to previously-reserved characters becoming
used such as U+32FF SQUARE ERA NAME REIWA, the Tangut components
756-768, the entire Khitan Small Script character set, and the Tangut
Ideographs.

A number of the changes in this diff are due to better/worse comment
tracking and the removal of the Emoji/EPres comments. The script once
mistakenly applied comments to packed regions (and it has been updated
to not do so.)

Validation

I build a test application that compared codepoints 0-FFFF for GQCW
against their new registered widths.

This commit also adds an override UCD and migrates all of the overrides
from GetQuickCharWidth into it.

GetQuickCharWidth
-----------------

GetQuickCharWidth incorrectly marked 67 resrved codepoints as Wide when
they should have been Narrow.

This reduces the number of comparisons required for looking up a single
character's width from 40 (32 individual ranged comparisons from GQCW +
8 from the binary search in CPWD) to 10 (2 from GQCW, 8 from CPWD).

This change also offers us the change to document exactly why we're
overriding a specific character range.

The codepoints whose definitions have changed from Wide to Narrow are:

```
2E9A 2EF4 2EF5 2EF6 2EF7 2EF8 2EF9 2EFA 2EFB 2EFC 2EFD 2EFE 2EFF 2FD6
2FD7 2FD8 2FD9 2FDA 2FDB 2FDC 2FDD 2FDE 2FDF 2FE0 2FE1 2FE2 2FE3 2FE4
2FE5 2FE6 2FE7 2FE8 2FE9 2FEA 2FEB 2FEC 2FED 2FEE 2FEF 2FFC 2FFD 2FFE
2FFF 31E4 31E5 31E6 31E7 31E8 31E9 31EA 31EB 31EC 31ED 31EE 31EF 321F
A48D A48E A48F FE1A FE1B FE1C FE1D FE1E FE1F FE53 FE67
```

All of them are reserved.

New in Unicde 13.0
------------------

Some widths have changed due to previously-reserved characters becoming
_used_ such U+32FF SQUARE ERA NAME REIWA, the Tangut components 756-768,
the entire Khitan Small Script character set, and the Tangut Ideographs.
@DHowett DHowett requested a review from miniksa October 26, 2020 23:42
Copy link
Member

@zadjii-msft zadjii-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, seems like a good idea to me

@@ -0,0 +1,9 @@
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<ucd xmlns="http://www.unicode.org/ns/2003/ucd/1.0">
<repertoire>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talk about a tough tag name to spell correctly...

@@ -354,179 +354,16 @@ std::deque<std::unique_ptr<KeyEvent>> SynthesizeNumpadEvents(const wchar_t wch,
// May-01-2019 MiNiksa Forced lookup-via-renderer for retroactively recategorized emoji
// that used to be narrow but now might be wide. (approx x2194-x2b55, not inclusive)
// Also forced block characters segment (x2580-x259F) to narrow
// Oct-25-2020 DuHowett Replaced the entire table with a set of overrides that get built into
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this plan, overrides declarative not in here.

@DHowett DHowett added the AutoMerge Marked for automatic merge by the bot when requirements are met label Oct 27, 2020
@ghost
Copy link

ghost commented Oct 27, 2020

Hello @DHowett!

Because this pull request has the AutoMerge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@ghost ghost merged commit 1df3182 into main Oct 27, 2020
@ghost ghost deleted the dev/duhowett/more_unicode branch October 27, 2020 17:36
@ghost
Copy link

ghost commented Nov 11, 2020

🎉Windows Terminal Preview v1.5.3142.0 has been released which incorporates this pull request.:tada:

Handy links:

DHowett added a commit that referenced this pull request Nov 16, 2020
* Run all images through ImgBot (CC-8169)
* Fix potential over/underflow as noted by "TODO:" comment (CC-8081)
* Fix garbling when copying multibyte text via OSC 52 (CC-7870)
* UIA: throw E_FAIL for out-of-bounds text (CC-8052)
* Consider the GlyphWidth when calculate the postion of matched word in URL detecting (CC-8124)
* Make the link underline less obtrusive; don't use it for pattern (GH-8148)
* Fully regenerate CodepointWidthDetector from Unicode 13.0 (GH-8035)
* Prepare for the primary branch name to change to main (GH-7985)
* Hash the URI as part of the hyperlink ID (GH-7940)
* Introduce til::presorted_static_map (GH-7640)
* Prevent leftover cursor fragments when scrolling in PowerShell (CC-8173)
* Add support for the DECREQTPARM report (CC-7939)
* Refactor VT parameter handling (CC-7799)
* Add support for the "blink" graphic rendition attribute (CC-7490)
* Combine the parsing & dispatch blocks for OSC actions (CC-8202)
* Add support for autodetecting URLs and making hyperlinks (CC-7691)
* Copy _currentHyperlinkId when copying the buffer (CC-8074)
* Fix the "visual representation" optimization for hyperlinks (CC-7738)
* Optimize the binary size of the XOrg color table (CC-7929)
* Add support for more OSC color formats (CC-7578)

Related work items: MSFT-30259074
@plastikfan
Copy link

If anyone knows, please can you let me know which version of Unicode WIndows Terminal currently supports, thanks.

@DHowett
Copy link
Member Author

DHowett commented Apr 9, 2021

@plastikfan do you have any reason to believe that it is not, in fact, 13.0?

@plastikfan
Copy link

No not all, but if thats the answer then great. Please confirm

@DHowett
Copy link
Member Author

DHowett commented Apr 9, 2021

In the issue you linked to this pull request, there is a discussion about the version of Unicode that is supported by the Windows Console. The version that Windows Terminal (and OpenConsole) supports doesn't mean anything for FluentTerm, because they are not using Windows Terminal (or OpenConsole)... so you're somewhat barking up the wrong tree in asking.

However: the inbox Windows console is updated from this repository, and this change made it into builds numbered 21xxx. Before that, we supported some smattering of Unicode versions 9, 10, and 11.

@plastikfan
Copy link

Sorry, I think you mis-intepret me. All I'm trying to do is assist in researching why emoji display is not working properly in Fluent terminal. I was asked to find out what version of Unicode Windows terminal is currently using (so that the correct so called unicode addon can be loaded into xterm) and with your info, this is done. Thank you for responding.

@DHowett
Copy link
Member Author

DHowett commented Apr 9, 2021

If xterm.js supports Unicode 13, and Fluent Terminal is using the traditional in-box Windows console, it will disagree about the size of a character with xterm.js which will cause visual artifacts.

DHowett added a commit that referenced this pull request Apr 27, 2021
…harWidth for numpad event synthesis

When we encounter clipboard input that cannot be mapped to a keyboard
key, we try our best to map it to an event. If if is an alphanumeric
character or a wide glyph (per the old width algorithm), we will pass it
through as the UnicodeChar of a key event. If it's *not* wide and *not*
alphanumeric, we will synthesize a set of Alt+NumPad events.

Those events comprise a set containing...

1. Alt Down (modifiers = LAlt, UnicodeChar = 0)
2. Numpad 0 Down/Up (modifiers = LAlt, UnicodeChar = 0)
3. Numpad 1 Down/Up (modifiers = LAlt, UnicodeChar = 0)
4. Numpad 2 Down/Up (modifiers = LAlt, UnicodeChar = 0)
5. Alt Up (modifiers = 0, UnicodeChar = [THE CHARACTER])

Because of event group 5, application developers seem to have taken a
dependency on receiving Alt Up + Character and don't seek to recompose
the original character from its numpad representation.

In pull request GH-8035 (!5394370), we stripped the old width algorithm
out and replaced it with a lookup table (finally!). Unfortunately, that
broke clipboard input for Chinese text as it was no longer considered
"Wide" for the purposes of detecting whether we should use numpad
events.

This commit introduces a version of GetQuickCharWidth that fulfills the
exact contract CharToKeyEvents needs, and doesn't answer for a codepoint
more. We'll use it in Windows to fix MSFT-32901370.

The Terminal analogue of this bug, GH-9052, is fixed by *never emitting
numpad events again.* We think this is okay because it looks like nobody
was ever handling numpad events... and that we were only using them as a
way to communicate within conhost (which we're *also* not using) and any
public exposition of event 5 as a contract was unintended.

VALIDATION
----------
I took this new implementation (with an early return) and the old
implementation and compared whether they would yield a numpad event or a
key event over the entire supported codepoint space [0000..FFFF].

They matched.

Fixes MSFT-32901370
Fixes GH-9052

Retrieved from https://microsoft.visualstudio.com os.2020 OS official/rs_wdx_dxp_windev 224751bbb061f5f59d794c2c9bdac5a9674ebde6
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AutoMerge Marked for automatic merge by the bot when requirements are met
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants