TNs with an OrigL quote that ends in a single quote mark fail. #7156

birchamp · 2021-06-28T18:36:45Z

If the OrigL quote ends with a u2019 ’, tC reports that there is a problem with the check. (for instance: ἀλλ’ in 2 Timothy 3:9 (ID:x9kx) ). tC should be able to handle checks where the quote ends with u2019. For more info see: https://unfoldingword.zulipchat.com/#narrow/stream/209457-SOFTWARE--.20UR/topic/Helpdesk/near/243565499

DoD: The checks in the picture below do not fail due to the single quote character at the end of the field.

The text was updated successfully, but these errors were encountered:

da1nerd · 2021-06-28T22:23:13Z

The is related to #7129

da1nerd · 2021-06-28T22:25:08Z

I gave this a 5 estimate, but if the tokenizer has already been updated, this could be a 1.

da1nerd · 2021-06-28T23:10:07Z

updating the string punctuation tokenizer to the latest version did not solve the problem. However, after testing the tokenizer on it's own, it does correctly tokenize the quote mark.

da1nerd · 2021-06-28T23:50:48Z

Here's where the check validation occurs. https://github.com/unfoldingWord/checking-tool-wrapper/blob/5f0bf79d2f3bb397a0c057db2ebdd7121c7a2ef8/src/components/VerseCheckWrapper.js#L120
still trying to figure out how the quote fits into this.

da1nerd · 2021-07-27T21:23:16Z

@birchamp I've tracked this bug down to the usfm. See this line https://git.door43.org/Door43-Catalog/en_ult/src/branch/master/56-2TI.usfm#L1040

Notice x-content="ἀλλ’" is including the u2019 character. This is not a parsing problem with tC, but a bug in the USFM.

RobH123 · 2021-07-27T21:58:45Z

Not sure why you think that's a content error @da1nerd ? That's Greek "alla" abbreviated to "all'". A bit like the English possessive in "Jesus'" "hand" (but in our files we use the correct single closing quote character for apostrophe).

Ah, maybe you thought that that apostrophe is sentence level punctuation??? It's part of the word indicating the elided/omitted final alpha.

da1nerd · 2021-07-27T22:21:55Z

@RobH123 is it correct to say that u2019 should always be considered a word character instead of punctuation?

RobH123 · 2021-07-27T22:32:16Z

Ha, no, sorry. Unicode is far from perfect and this is a good example. It's also used as a single closing quote. So ambiguously used at both levels (just to make it harder for programmers) @da1nerd. (Well, not really ambiguous, but you usually have to parse the entire sentence to distinguish them.)

da1nerd · 2021-07-27T23:56:41Z

you just made this issue a whole lot more complicated 😆

Given a full sentence, how can I know how to parse it correctly? Does this situation apply to other languages as well or just to the Greek?

RobH123 · 2021-07-28T00:12:16Z

Yeah, parsing natural language is an ugly sport. If would certainly be easier if Unicode recommended a different apostrophe character, but apparently one of the initial Unicode policies was "unifying" (rather than "distinguishing"). There is u+02BC which was a (not ideal) possibility to change to, but inconvenient in other ways. Anyway, for English (and Koine Greek too I guess) the long and short of it is: if there's no error in the text, a single closing quote will normally follow sometime after a single opening quote. If it doesn't, and if it appears to be inside or at the beginning or end of a word (i.e., not immediately after punctuation like comma or period, etc.), then it's probably an apostrophe and hence effectively part of the word. (Hope you've still got plenty of hair to pull out.) :-)

birchamp · 2021-07-28T17:53:20Z

Since the initial punctuation usually isn't included in a quote like that, @RobH123 can we make a rule that aligning sentence-level punctuation at the end of a OrigL quote is unacceptable?

RobH123 · 2021-07-28T23:13:19Z

@birchamp Yes, I think that's already the rule of thumb, but I can write that specifically into the spec. Hmmh, but then maybe not. What if we wanted a note on the following snippet (pretend it's Greek): Jesus said, “Come!” -- could hardly leave off those final two characters and force the snippet to be: Jesus said, “Come

birchamp · 2021-07-29T00:35:17Z

What if the rule was that the punctuation should not be included unless the initial punctuation is within the quote.

RobH123 · 2021-07-29T00:44:26Z

Ha, I'd guessed that you were going to suggest that (should have suggested it myself). Yes, seems sensible enough to me @birchamp, although you should at least consult Larry S on Zulip. It doesn't making the parsing trivial, but should at least make it mostly sensible (hopefully). :-)

I guess the rule might be something like: "Only include paired sentence punctuation in a origL snippet if both sides of the pair are included." (Sadly I can think of examples where it might not be possible, but Larry and others would likely know if they actually occur in practice.)

birchamp · 2021-07-29T01:11:21Z

I was thinking it should read something like, "An origL quote should not end in sentence level punctuation unless both pieces of paired punctuation are included in the quote."

da1nerd · 2021-07-29T20:16:58Z

@birchamp how should I proceed with this?

Without improving the text parsing, the only other solution I see right now is to update https://github.com/unfoldingWord/tc-ui-toolkit/blob/develop/src/VerseCheck/helpers/checkAreaHelpers.js#L28 so that it flattens the verse objects and compares them as a string instead of piece by piece so that in this case it doesn't matter it the quote mark is split off the word. However, I'm not certain if that would have negative side effects. @mannycolon or @PhotoNomad0 might know.

birchamp · 2021-07-30T18:21:28Z

@da1nerd @RobH123 - Ok Per Larry the Greek and Hebrew do not contain quotation marks therefore if the text is an original quote then we can assume that the " ' " u2019 is a needed apostrophe even at the end of an Original language quote.

However, @da1nerd do you know if you're using the same code to handle/normalize GL quotes and Original quotes?

da1nerd · 2021-07-30T21:04:20Z

that appears to be the case.

birchamp · 2021-07-30T21:23:06Z

Hmmm...that's going to be a problem because we don't want to have this affect how we handle single quotation marks, but in Koine Greek they are apostrophes

birchamp · 2021-08-17T21:44:38Z

@da1nerd did we ever come to a conclusion on this?

da1nerd · 2021-08-17T22:41:02Z

No. Perhaps I misunderstood, but I thought there was going to be some more content discussion on this. Either way, I'll need some direction on how we want to proceed.

birchamp · 2021-08-22T20:16:14Z

Yes, @da1nerd ok per https://unfoldingword.zulipchat.com/#narrow/stream/209457-SOFTWARE--.20UR/topic/Helpdesk/near/247737526 we've come to the conclusion that the apostrophe should be shown in original languages even if it's at the end of a field.

da1nerd · 2021-08-24T00:15:27Z

ok I've added a tokenizeOrigLang method that is specialized for original languages. We'll need to use this anywhere we are tokenizing an original language.

da1nerd · 2021-08-27T21:36:48Z

I have a few more modules to update and PRs to make, but it appears that this is working now.

RobH123 · 2021-08-28T21:32:31Z

@da1nerd I was just thinking about this today. There's no reason (other than wait times) why the Content Validation can't check every OrigL quote by running your actual JS library functions (alongside my specific checks) if there's a way for me to determine failure. In other words, the CV package could verify our data by running the actual code that our tools run. I would need help knowing what libraries (and what parameters) our tools use. It probably won't give very helpful error messages about what is wrong (data or software), but on the positive side, should be able to verify that our software can process all of our actual data. What do you think? @birchamp also.

birchamp · 2021-08-30T17:55:42Z

@RobH123 I think that's a great idea. It would create a very valuable feedback loop.

da1nerd · 2021-08-30T23:55:22Z

@RobH123 that would be handy.

RobH123 · 2021-08-31T02:45:00Z

Can you guide me @da1nerd on where to find the function to call (and if not obvious, how to determine if it succeeds/fails)? Is it on npmjs?

da1nerd · 2021-09-01T23:22:58Z

@birchamp I discovered some html line breaks in some of the translation notes.
This caused the windows build of tC to fail. We could probably update the content downloader to sanitize the data, but I feel like that would be a poor choice because the bad data would still be out there for other people to run into. Might be better to do what @RobH123 suggested and set up a feedback loop.

Note: linux and mac builds didn't fail because the <br> characters are valid for use in file names. So it actually created that file, but that's an error as well.

Here's the offending file https://git.door43.org/unfoldingWord/en_tn/src/tag/v52/en_tn_09-1SA.tsv
If you simply try looking at the rendered view in the repo you won't see the <br> tags, but if you view the raw source, or open it in excel you'll see them there.

You can see the result of this bug in the linux app

RobH123 · 2021-09-02T04:30:36Z

Never seen this before, but since it crashes your build, I added a specific high-priority check to https://unfoldingword.github.io/uw-content-validation/#/Book%20Package%20Checker%20Demo?id=bookpackagecheck (set BP to 1SA and disableLinkFetching to view it -- takes about 4 mins).

But I suspect that it was probably a manual error by a human editor that won't happen again. I can fix the file and republish my tomorrow if you like?

da1nerd · 2021-09-02T23:51:12Z

Cool, just let me know when it's done so I can restart the build.

In answer to your earlier question @RobH123 . I think you'd probably need to use https://github.com/unfoldingWord/tsv-groupdata-parser and pass in your tsv data to it.

If you did so then you could perform a number of assertions on the output to make sure it looks the way you expect it to.

birchamp · 2021-09-09T13:23:31Z

@richmahn @da1nerd what's the status on this?

da1nerd · 2021-09-11T06:46:00Z

@birchamp I hadn't heard yet if the tN had been updated. The builds were failing due to the content bug described here. #7156 (comment)

@RobH123 do you know if tN was updated?

RobH123 · 2021-09-16T23:18:01Z

Ok, got distracted sorry, but published en_tn v53 just now @da1nerd. Hopefully it will build ok for you now.

da1nerd · 2021-09-17T17:53:19Z

I've restarted the builds. We'll see!

mannycolon · 2021-09-24T14:20:11Z

@birchamp I discovered some html line breaks in some of the translation notes.
This caused the windows build of tC to fail. We could probably update the content downloader to sanitize the data, but I feel like that would be a poor choice because the bad data would still be out there for other people to run into. Might be better to do what @RobH123 suggested and set up a feedback loop.

Note: linux and mac builds didn't fail because the <br> characters are valid for use in file names. So it actually created that file, but that's an error as well.

Here's the offending file https://git.door43.org/unfoldingWord/en_tn/src/tag/v52/en_tn_09-1SA.tsv
If you simply try looking at the rendered view in the repo you won't see the <br> tags, but if you view the raw source, or open it in excel you'll see them there.

You can see the result of this bug in the linux app

Great catch @da1nerd

mannycolon · 2021-09-24T14:20:34Z

Great job guys! I just merged the PR. Thank you.

elsylambert · 2021-11-05T18:00:52Z

Works fine in Eph: Tns with OrigL quote ending with a single quote mark highlighted correctly. Some of the Notes given in the example are missing in the project.

gayathrihayden · 2021-11-05T18:12:33Z

Tested in tC 3.0.2 (2ede16e):

No longer see an alert for the reference 2 Timothy 3:9 that has a single quote mark "ἀλλ’" (specifically ’ right single quotation Mark U+2019)

This fix also resolves this issue https://git.door43.org/unfoldingWord/en_tn/issues/1540 - tC Ephesians 6:6 Alert Issue for the sub-category "Connect - Contrast Relationship" in tN #1540 where alerts are no longer seen for Eph 6:6, 5:15, Eph 5:27 and Eph 4:29.

elsylambert · 2021-11-09T20:48:10Z

Looks good in translationCore 3.0.2 (38258ea).

gayathrihayden · 2021-11-15T22:15:54Z

Fix tested in tC 3.0.2 build 18ad847 and works fine.
Used 2 Tim 3:9, Eph 4:29; 5:15, 27; 6:6 as examples.

PhotoNomad0 · 2021-12-02T19:49:08Z

Seeing problem in tC 3.0.2 with migrating old selections due to the above quote fix. Due to the change made in how we handle the u2019 ’ character, the quote arrays are generated differently. So any tn selections that have multiple checks with the same reference, groupID, occurrence, and that have that mark in them run the risk of being removed or have the wrong quote applied to them.

The old parsing of quotes looks like this (example from https://git.door43.org/India_BCS/kn_kglt_2jn_book):

It mistakenly made the accent a new word. The new parsing looks like this:

Unfortunately this makes our existing migration bug (#7198) even worse.

It does seem to be a small migration fix at first look.

PhotoNomad0 · 2021-12-04T11:07:46Z

This is being fixed as part of #7198

birchamp added Priority/High Status/NeedsDefinition labels Jun 28, 2021

birchamp assigned da1nerd Jun 28, 2021

birchamp changed the title ~~TNs with a OrigL quote that ends in a single quote fail.~~ TNs with an OrigL quote that ends in a single quote mark fail. Jun 28, 2021

da1nerd mentioned this issue Aug 24, 2021

added specialized tokenizer for original languages unfoldingWord/string-punctuation-tokenizer#73

Merged

da1nerd mentioned this issue Aug 25, 2021

use specialized tokenizer for original languages unfoldingWord/tsv-groupdata-parser#48

Merged

da1nerd removed the Status/NeedsDefinition label Aug 25, 2021

da1nerd mentioned this issue Aug 27, 2021

Updated tokenization unfoldingWord/tc-source-content-updater#131

Merged

da1nerd mentioned this issue Aug 27, 2021

update parsing support for original languages #7163

Merged

16 tasks

da1nerd mentioned this issue Sep 1, 2021

Build tC using the latest String-Punctuation-tokenizer #7129

Closed

elsylambert added the QA/Pass label Nov 9, 2021

elsylambert closed this as completed Aug 18, 2022

TNs with an OrigL quote that ends in a single quote mark fail. #7156

TNs with an OrigL quote that ends in a single quote mark fail. #7156

Comments

birchamp commented Jun 28, 2021

da1nerd commented Jun 28, 2021

da1nerd commented Jun 28, 2021

da1nerd commented Jun 28, 2021 • edited Loading

da1nerd commented Jun 28, 2021

da1nerd commented Jul 27, 2021

RobH123 commented Jul 27, 2021 • edited Loading

da1nerd commented Jul 27, 2021

RobH123 commented Jul 27, 2021 • edited Loading

da1nerd commented Jul 27, 2021

RobH123 commented Jul 28, 2021 • edited Loading

birchamp commented Jul 28, 2021

RobH123 commented Jul 28, 2021

birchamp commented Jul 29, 2021

RobH123 commented Jul 29, 2021

birchamp commented Jul 29, 2021

da1nerd commented Jul 29, 2021 • edited Loading

birchamp commented Jul 30, 2021

da1nerd commented Jul 30, 2021

birchamp commented Jul 30, 2021

birchamp commented Aug 17, 2021

da1nerd commented Aug 17, 2021

birchamp commented Aug 22, 2021

da1nerd commented Aug 24, 2021

da1nerd commented Aug 27, 2021

RobH123 commented Aug 28, 2021

birchamp commented Aug 30, 2021

da1nerd commented Aug 30, 2021

RobH123 commented Aug 31, 2021

da1nerd commented Sep 1, 2021 • edited Loading

RobH123 commented Sep 2, 2021

da1nerd commented Sep 2, 2021

birchamp commented Sep 9, 2021

da1nerd commented Sep 11, 2021

RobH123 commented Sep 16, 2021

da1nerd commented Sep 17, 2021

mannycolon commented Sep 24, 2021

mannycolon commented Sep 24, 2021

elsylambert commented Nov 5, 2021

gayathrihayden commented Nov 5, 2021

elsylambert commented Nov 9, 2021

gayathrihayden commented Nov 15, 2021

PhotoNomad0 commented Dec 2, 2021 • edited Loading

PhotoNomad0 commented Dec 4, 2021

da1nerd commented Jun 28, 2021 •

edited

Loading

RobH123 commented Jul 27, 2021 •

edited

Loading

RobH123 commented Jul 27, 2021 •

edited

Loading

RobH123 commented Jul 28, 2021 •

edited

Loading

da1nerd commented Jul 29, 2021 •

edited

Loading

da1nerd commented Sep 1, 2021 •

edited

Loading

PhotoNomad0 commented Dec 2, 2021 •

edited

Loading