Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TNs with an OrigL quote that ends in a single quote mark fail. #7156

Closed
birchamp opened this issue Jun 28, 2021 · 43 comments
Closed

TNs with an OrigL quote that ends in a single quote mark fail. #7156

birchamp opened this issue Jun 28, 2021 · 43 comments

Comments

@birchamp
Copy link
Contributor

If the OrigL quote ends with a u2019 , tC reports that there is a problem with the check. (for instance: ἀλλ’ in 2 Timothy 3:9 (ID:x9kx) ). tC should be able to handle checks where the quote ends with u2019. For more info see: https://unfoldingword.zulipchat.com/#narrow/stream/209457-SOFTWARE--.20UR/topic/Helpdesk/near/243565499

DoD: The checks in the picture below do not fail due to the single quote character at the end of the field.
image.png

@birchamp birchamp changed the title TNs with a OrigL quote that ends in a single quote fail. TNs with an OrigL quote that ends in a single quote mark fail. Jun 28, 2021
@da1nerd
Copy link
Contributor

da1nerd commented Jun 28, 2021

The is related to #7129

@da1nerd
Copy link
Contributor

da1nerd commented Jun 28, 2021

I gave this a 5 estimate, but if the tokenizer has already been updated, this could be a 1.

@da1nerd
Copy link
Contributor

da1nerd commented Jun 28, 2021

updating the string punctuation tokenizer to the latest version did not solve the problem. However, after testing the tokenizer on it's own, it does correctly tokenize the quote mark.

@da1nerd
Copy link
Contributor

da1nerd commented Jun 28, 2021

Here's where the check validation occurs. https://github.com/unfoldingWord/checking-tool-wrapper/blob/5f0bf79d2f3bb397a0c057db2ebdd7121c7a2ef8/src/components/VerseCheckWrapper.js#L120
still trying to figure out how the quote fits into this.

@da1nerd
Copy link
Contributor

da1nerd commented Jul 27, 2021

@birchamp I've tracked this bug down to the usfm. See this line https://git.door43.org/Door43-Catalog/en_ult/src/branch/master/56-2TI.usfm#L1040

Notice x-content="ἀλλ’" is including the u2019 character. This is not a parsing problem with tC, but a bug in the USFM.

@RobH123
Copy link

RobH123 commented Jul 27, 2021

Not sure why you think that's a content error @da1nerd ? That's Greek "alla" abbreviated to "all'". A bit like the English possessive in "Jesus'" "hand" (but in our files we use the correct single closing quote character for apostrophe).

Ah, maybe you thought that that apostrophe is sentence level punctuation??? It's part of the word indicating the elided/omitted final alpha.

@da1nerd
Copy link
Contributor

da1nerd commented Jul 27, 2021

@RobH123 is it correct to say that u2019 should always be considered a word character instead of punctuation?

@RobH123
Copy link

RobH123 commented Jul 27, 2021

Ha, no, sorry. Unicode is far from perfect and this is a good example. It's also used as a single closing quote. So ambiguously used at both levels (just to make it harder for programmers) @da1nerd. (Well, not really ambiguous, but you usually have to parse the entire sentence to distinguish them.)

@da1nerd
Copy link
Contributor

da1nerd commented Jul 27, 2021

you just made this issue a whole lot more complicated 😆

Given a full sentence, how can I know how to parse it correctly? Does this situation apply to other languages as well or just to the Greek?

@RobH123
Copy link

RobH123 commented Jul 28, 2021

Yeah, parsing natural language is an ugly sport. If would certainly be easier if Unicode recommended a different apostrophe character, but apparently one of the initial Unicode policies was "unifying" (rather than "distinguishing"). There is u+02BC which was a (not ideal) possibility to change to, but inconvenient in other ways. Anyway, for English (and Koine Greek too I guess) the long and short of it is: if there's no error in the text, a single closing quote will normally follow sometime after a single opening quote. If it doesn't, and if it appears to be inside or at the beginning or end of a word (i.e., not immediately after punctuation like comma or period, etc.), then it's probably an apostrophe and hence effectively part of the word. (Hope you've still got plenty of hair to pull out.) :-)

@birchamp
Copy link
Contributor Author

Since the initial punctuation usually isn't included in a quote like that, @RobH123 can we make a rule that aligning sentence-level punctuation at the end of a OrigL quote is unacceptable?

@RobH123
Copy link

RobH123 commented Jul 28, 2021

@birchamp Yes, I think that's already the rule of thumb, but I can write that specifically into the spec. Hmmh, but then maybe not. What if we wanted a note on the following snippet (pretend it's Greek): Jesus said, “Come!” -- could hardly leave off those final two characters and force the snippet to be: Jesus said, “Come

@birchamp
Copy link
Contributor Author

What if the rule was that the punctuation should not be included unless the initial punctuation is within the quote.

@RobH123
Copy link

RobH123 commented Jul 29, 2021

Ha, I'd guessed that you were going to suggest that (should have suggested it myself). Yes, seems sensible enough to me @birchamp, although you should at least consult Larry S on Zulip. It doesn't making the parsing trivial, but should at least make it mostly sensible (hopefully). :-)

I guess the rule might be something like: "Only include paired sentence punctuation in a origL snippet if both sides of the pair are included." (Sadly I can think of examples where it might not be possible, but Larry and others would likely know if they actually occur in practice.)

@birchamp
Copy link
Contributor Author

I was thinking it should read something like, "An origL quote should not end in sentence level punctuation unless both pieces of paired punctuation are included in the quote."

@da1nerd
Copy link
Contributor

da1nerd commented Jul 29, 2021

@birchamp how should I proceed with this?

Without improving the text parsing, the only other solution I see right now is to update https://github.com/unfoldingWord/tc-ui-toolkit/blob/develop/src/VerseCheck/helpers/checkAreaHelpers.js#L28 so that it flattens the verse objects and compares them as a string instead of piece by piece so that in this case it doesn't matter it the quote mark is split off the word. However, I'm not certain if that would have negative side effects. @mannycolon or @PhotoNomad0 might know.

@birchamp
Copy link
Contributor Author

@da1nerd @RobH123 - Ok Per Larry the Greek and Hebrew do not contain quotation marks therefore if the text is an original quote then we can assume that the " ' " u2019 is a needed apostrophe even at the end of an Original language quote.

However, @da1nerd do you know if you're using the same code to handle/normalize GL quotes and Original quotes?

@da1nerd
Copy link
Contributor

da1nerd commented Jul 30, 2021

that appears to be the case.

@birchamp
Copy link
Contributor Author

Hmmm...that's going to be a problem because we don't want to have this affect how we handle single quotation marks, but in Koine Greek they are apostrophes

@birchamp
Copy link
Contributor Author

@da1nerd did we ever come to a conclusion on this?

@da1nerd
Copy link
Contributor

da1nerd commented Aug 17, 2021

No. Perhaps I misunderstood, but I thought there was going to be some more content discussion on this. Either way, I'll need some direction on how we want to proceed.

@birchamp
Copy link
Contributor Author

Yes, @da1nerd ok per https://unfoldingword.zulipchat.com/#narrow/stream/209457-SOFTWARE--.20UR/topic/Helpdesk/near/247737526 we've come to the conclusion that the apostrophe should be shown in original languages even if it's at the end of a field.

@da1nerd
Copy link
Contributor

da1nerd commented Aug 24, 2021

ok I've added a tokenizeOrigLang method that is specialized for original languages. We'll need to use this anywhere we are tokenizing an original language.

@da1nerd
Copy link
Contributor

da1nerd commented Aug 27, 2021

I have a few more modules to update and PRs to make, but it appears that this is working now.
image

@RobH123
Copy link

RobH123 commented Aug 28, 2021

@da1nerd I was just thinking about this today. There's no reason (other than wait times) why the Content Validation can't check every OrigL quote by running your actual JS library functions (alongside my specific checks) if there's a way for me to determine failure. In other words, the CV package could verify our data by running the actual code that our tools run. I would need help knowing what libraries (and what parameters) our tools use. It probably won't give very helpful error messages about what is wrong (data or software), but on the positive side, should be able to verify that our software can process all of our actual data. What do you think? @birchamp also.

@birchamp
Copy link
Contributor Author

@RobH123 I think that's a great idea. It would create a very valuable feedback loop.

@da1nerd
Copy link
Contributor

da1nerd commented Aug 30, 2021

@RobH123 that would be handy.

@RobH123
Copy link

RobH123 commented Aug 31, 2021

Can you guide me @da1nerd on where to find the function to call (and if not obvious, how to determine if it succeeds/fails)? Is it on npmjs?

@da1nerd
Copy link
Contributor

da1nerd commented Sep 1, 2021

@birchamp I discovered some html line breaks in some of the translation notes.
This caused the windows build of tC to fail. We could probably update the content downloader to sanitize the data, but I feel like that would be a poor choice because the bad data would still be out there for other people to run into. Might be better to do what @RobH123 suggested and set up a feedback loop.

Note: linux and mac builds didn't fail because the <br> characters are valid for use in file names. So it actually created that file, but that's an error as well.

image

Here's the offending file https://git.door43.org/unfoldingWord/en_tn/src/tag/v52/en_tn_09-1SA.tsv
If you simply try looking at the rendered view in the repo you won't see the <br> tags, but if you view the raw source, or open it in excel you'll see them there.

You can see the result of this bug in the linux app
image

@RobH123
Copy link

RobH123 commented Sep 2, 2021

Never seen this before, but since it crashes your build, I added a specific high-priority check to https://unfoldingword.github.io/uw-content-validation/#/Book%20Package%20Checker%20Demo?id=bookpackagecheck (set BP to 1SA and disableLinkFetching to view it -- takes about 4 mins).

But I suspect that it was probably a manual error by a human editor that won't happen again. I can fix the file and republish my tomorrow if you like?

@da1nerd
Copy link
Contributor

da1nerd commented Sep 2, 2021

Cool, just let me know when it's done so I can restart the build.

In answer to your earlier question @RobH123 . I think you'd probably need to use https://github.com/unfoldingWord/tsv-groupdata-parser and pass in your tsv data to it.

If you did so then you could perform a number of assertions on the output to make sure it looks the way you expect it to.

@birchamp
Copy link
Contributor Author

birchamp commented Sep 9, 2021

@richmahn @da1nerd what's the status on this?

@da1nerd
Copy link
Contributor

da1nerd commented Sep 11, 2021

@birchamp I hadn't heard yet if the tN had been updated. The builds were failing due to the content bug described here. #7156 (comment)

@RobH123 do you know if tN was updated?

@RobH123
Copy link

RobH123 commented Sep 16, 2021

Ok, got distracted sorry, but published en_tn v53 just now @da1nerd. Hopefully it will build ok for you now.

@da1nerd
Copy link
Contributor

da1nerd commented Sep 17, 2021

I've restarted the builds. We'll see!

@mannycolon
Copy link
Contributor

@birchamp I discovered some html line breaks in some of the translation notes.
This caused the windows build of tC to fail. We could probably update the content downloader to sanitize the data, but I feel like that would be a poor choice because the bad data would still be out there for other people to run into. Might be better to do what @RobH123 suggested and set up a feedback loop.

Note: linux and mac builds didn't fail because the <br> characters are valid for use in file names. So it actually created that file, but that's an error as well.

image

Here's the offending file https://git.door43.org/unfoldingWord/en_tn/src/tag/v52/en_tn_09-1SA.tsv
If you simply try looking at the rendered view in the repo you won't see the <br> tags, but if you view the raw source, or open it in excel you'll see them there.

You can see the result of this bug in the linux app
image

Great catch @da1nerd

@mannycolon
Copy link
Contributor

Great job guys! I just merged the PR. Thank you.

@elsylambert
Copy link

Works fine in Eph: Tns with OrigL quote ending with a single quote mark highlighted correctly. Some of the Notes given in the example are missing in the project.

@gayathrihayden
Copy link

Tested in tC 3.0.2 (2ede16e):

No longer see an alert for the reference 2 Timothy 3:9 that has a single quote mark "ἀλλ’" (specifically ’ right single quotation Mark U+2019)

2Ti_ch3_vs9

This fix also resolves this issue https://git.door43.org/unfoldingWord/en_tn/issues/1540 - tC Ephesians 6:6 Alert Issue for the sub-category "Connect - Contrast Relationship" in tN #1540 where alerts are no longer seen for Eph 6:6, 5:15, Eph 5:27 and Eph 4:29.

Eph_4_29

@elsylambert
Copy link

Looks good in translationCore 3.0.2 (38258ea).

@gayathrihayden
Copy link

Fix tested in tC 3.0.2 build 18ad847 and works fine.
Used 2 Tim 3:9, Eph 4:29; 5:15, 27; 6:6 as examples.

@PhotoNomad0
Copy link
Contributor

PhotoNomad0 commented Dec 2, 2021

Seeing problem in tC 3.0.2 with migrating old selections due to the above quote fix. Due to the change made in how we handle the u2019 ’ character, the quote arrays are generated differently. So any tn selections that have multiple checks with the same reference, groupID, occurrence, and that have that mark in them run the risk of being removed or have the wrong quote applied to them.

The old parsing of quotes looks like this (example from https://git.door43.org/India_BCS/kn_kglt_2jn_book):

Screen Shot 2021-12-02 at 2 36 51 PM

It mistakenly made the accent a new word. The new parsing looks like this:

Screen Shot 2021-12-02 at 2 37 22 PM

Unfortunately this makes our existing migration bug (#7198) even worse.

It does seem to be a small migration fix at first look.

@PhotoNomad0
Copy link
Contributor

This is being fixed as part of #7198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants