Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collation: add new pinyin collation utf8mb4_zh_pinyin_tidb_as_cs #1058

Merged
merged 6 commits into from
Oct 22, 2020

Conversation

xiongjiwei
Copy link
Contributor

@xiongjiwei xiongjiwei commented Oct 16, 2020

What problem does this PR solve?

add a new collation named utf8mb4_zh_pinyin_tidb_as_cs which has collation id 2048
proposal pingcap/tidb#19984
tidb PR pingcap/tidb#20504

What is changed and how it works?

a new method to get collation by given id.

Check List

Tests

  • Unit test
  • Integration test

Code changes

  • Has exported function/method change

Copy link
Member

@wjhuang2016 wjhuang2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about rename utf8mb4_general_zh_ci to utf8mb4_zh_ci

@kennytm
Copy link
Contributor

kennytm commented Oct 20, 2020

if we want to follow MySQL's naming convention it should be utf8mb4_chinese_ci (similar to big5_chinese_ci, gb18030_chinese_ci, gb2312_chinese_ci, gb2312_chinese_ci).

(note that CLDR also defined the "stroke" and "zhuyin" order, and we could have utf8mb4_chinese_stroke_ci etc.)

@wjhuang2016
Copy link
Member

I don't want to follow MySQL's old naming convention, zh is more standard than chinese.
As you can see in the new collation introduced in 8.0, utf8mb4_zh_0900_as_cs uses zh.
Consider that CLDR has defined "pinyin", "stroke" and "zhuyin", we can use "utf8mb4_zh_pinyin", "utf8mb4_zh_stroke", "utf8mb4_zh_zhuyin". If we want to distinguish case-sensitive, we can add extra '_cs' or '_ci'.

@kennytm
Copy link
Contributor

kennytm commented Oct 20, 2020

🤷

however this collation is explicitly not following (MySQL's implementation of) UCA, otherwise we will be implementing pingcap/tidb#19747.

given that the naming convention of all 8.0 collations are <charset>[_<language>[_<variant>]]_<unicodeversion>(_<attribute>)+, perhaps

  • utf8mb4_zh_pinyin_tidb0500_ci
  • utf8mb4_zh_tidb0500_ci ?
  • (utf8mb4_zh_stroke_tidb0500_ci)

@xiongjiwei xiongjiwei changed the title collation: add new pinyin collation utf8mb4_general_zh_ci collation: add new pinyin collation utf8mb4_tidb_zh_pinyin_cs Oct 22, 2020
Copy link
Member

@wjhuang2016 wjhuang2016 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 LGT1 label Oct 22, 2020
Copy link
Contributor

@kennytm kennytm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming it is a typo, rest LGTM.

(btw, are we sure we want case-sensitive ('a' ≠ 'A') not case-insensitive ('a' = 'A') here?)

@xiongjiwei xiongjiwei changed the title collation: add new pinyin collation utf8mb4_tidb_zh_pinyin_cs collation: add new pinyin collation utf8mb4_zh_tidb_pinyin_cs Oct 22, 2020
kennytm
kennytm previously approved these changes Oct 22, 2020
@ti-srebot ti-srebot removed the status/LGT1 LGT1 label Oct 22, 2020
ti-srebot
ti-srebot previously approved these changes Oct 22, 2020
@ti-srebot ti-srebot added the status/LGT2 LGT2 label Oct 22, 2020
@xiongjiwei xiongjiwei dismissed stale reviews from ti-srebot and kennytm via 959ae44 October 22, 2020 10:39
@xiongjiwei xiongjiwei changed the title collation: add new pinyin collation utf8mb4_zh_tidb_pinyin_cs collation: add new pinyin collation utf8mb4_zh_pinyin_tidb_as_cs Oct 22, 2020
@kennytm kennytm requested a review from bb7133 October 22, 2020 10:42
@bb7133
Copy link
Member

bb7133 commented Oct 22, 2020

LGTM

Thanks for your suggesion @kennytm

@ti-srebot ti-srebot removed the status/LGT2 LGT2 label Oct 22, 2020
@ti-srebot ti-srebot added the status/LGT3 LGT3. This PR looks very good to our bot. label Oct 22, 2020
@bb7133 bb7133 merged commit db4c0c5 into pingcap:master Oct 22, 2020
@xiongjiwei xiongjiwei deleted the zh_ci branch October 23, 2020 02:45
tiancaiamao pushed a commit to tiancaiamao/parser that referenced this pull request Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution status/LGT3 LGT3. This PR looks very good to our bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants