Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diacritical marks are ignored when differientiating #169

Closed
hadithmv opened this issue Oct 2, 2022 · 8 comments
Closed

Diacritical marks are ignored when differientiating #169

hadithmv opened this issue Oct 2, 2022 · 8 comments

Comments

@hadithmv
Copy link

hadithmv commented Oct 2, 2022

While mergely does not by default ignore things like whitespace, case and accents, it does however ignore diacritical marks.

I've included a picture as well as a diff file that shows this for a couple of languages:

image
RL2vASt1.zip

@hadithmv
Copy link
Author

hadithmv commented Oct 2, 2022

It seems I can only download a single diff file from the site, yet import requires two separate files for two different sides. So instead here's a link: https://editor.mergely.com/UsHJyREC/

@hadithmv
Copy link
Author

hadithmv commented Oct 2, 2022

Or because merge left/right arrows are there, it does know that there are differences, just doesn't visually show them with colors as expected.

While the diacritic mark is small by itself, in actual use it always goes on top of or under another letter. So ideally, both the diacritic and the letter it is on should be highlighted during differentiation. Otherwise, the word.

@wickedest
Copy link
Owner

@hadithmv, thanks for reporting the issue. there are actually 2 separate diffs being run. The 1st is by line, and only detects lines that are different. The 2nd is by character, within the line. It appears the 2nd diff may be failing diacritical marks for some reason.

@wickedest wickedest added the bug label Oct 3, 2022
@hadithmv
Copy link
Author

hadithmv commented Jun 11, 2024

Its been a while, and I don't know if this would be of help, but,

image

This is how vscode's shows it on their diff compare, it highlights the exact place where the change occurs. I thought that since it was open source, it might be possible to check out how they are getting this to work.

@hadithmv
Copy link
Author

here's how another diff compare tool fixed this particular issue:

https://gitlab.gnome.org/GNOME/meld/-/commit/d272016a481a9f9563d318a79ce9767cb8ad69ef

might be of help.

@wickedest
Copy link
Owner

Sorry - this has been on the bottom of the pile for a while.

This is a tough one. Normally, diacriticals can be combined into a single character, e.g. a + ´ = á. This can be done with String.normalize("NFC"), but normalize does not work in this case.

For example, if we look at كلمة and كَلمة:

image

I am not a unicode expert, so I'm a bit lost because while it behaves like a diacritical, as far as I can tell, \u064E is something else. I understand diacriticals to be in these ranges (source):

  • 0300-036F
  • 1AB0-1AFF
  • 1DC0-1DFF
  • 20D0-20FF
  • 2DE2-2DFF
  • FE20-FE2F

\u064E is not in those ranges and is identified as an "Arabic Fatha". Unlike \u0300 is a "Diacritical".

My guess is that the Fatha cannot be combined into a single character using normalize and whenever Mergely encounters these Fatha's, then it should include the adjacent character in the diff (i.e. ['0643', '064E']). But that isn't going to be easy. For starters, outside of the ranges above, I have no idea how many unicode characters behave this way.

@wickedest
Copy link
Owner

wickedest commented Jun 15, 2024

The problem seems to be specifically with nonspacing marks. The diacritical might only be a subset. Only a small number of "common" marks can be normalized into a single character. So the issue here is handling the situation when it cannot be normalized (which I think would happen a lot).

github-actions bot pushed a commit that referenced this issue Jun 16, 2024
# [5.3.0](v5.2.0...v5.3.0) (2024-06-16)

### Features

* Supports unicode diacritical marks when rendering line diff (fixes [#169](#169)) ([#197](#197)) ([a469a65](a469a65))
Copy link

🎉 This issue has been resolved in version 5.3.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants