Incompatible with spacy v2.2.3? #18

mcswell · 2020-05-21T22:22:02Z

I have spacy v2.1.9 installed on one machine, and 2.2.3 (the current latest version) on another. I installed spacy-ru on both, but it only runs well on the 2.1.9 machine. On the 2.2.3 machine, when I do the
doc=nlp(s)
step (with s=Russian text), I get the error

doc=nlp(s)
Traceback (most recent call last):
File "&lt;stdin&gt;", line 1, in &lt;module&gt;
File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 435, 
in __call__
doc = proc(doc, **component_cfg.get(name, {}))
File "pipes.pyx", line 397, in spacy.pipeline.pipes.Tagger.__call__
File "pipes.pyx", line 442, in spacy.pipeline.pipes.Tagger.set_annotations
File "morphology.pyx", line 312, in spacy.morphology.Morphology.assign_tag_id
File "morphology.pyx", line 200, in spacy.morphology.Morphology.add
ValueError: [E167] Unknown morphological feature: 'Person' (2313063860588076218). 
This can happen if the tagger was trained with a different set of morphological features. 
If you're using a pretrained model, make sure that your models are up to date:
python -m spacy validate

I guess I could build spacy-ru from source and maybe this would solve the problem, but I'm not sure I'm up to that. What I did instead was to uninstall version 2.2.3 of spacy, and install version 2.1.9 in its place, so now spacy-ru works on both machines.

But I'd rather be using the current version of spacy, which I use for a couple other languages as well. (Even better, I'd like spacy-ru to be immune to version changes in spacy, but I suppose that's asking a bit much :-).)

Is there a (simple) way to make spacy-ru compatible with v2.2 of spacy?

The text was updated successfully, but these errors were encountered:

buriy · 2020-05-22T05:22:00Z

This is a bug in spacy, that it doesn't allow numerical features in the Syntagrus dataset used for training ("Person=1", "Person=2", "Person=3").
I have a version with this tag changed (to "Person=first" etc), that will work with 2.2 branch correctly.
I'll prepare and upload it early next week.
You can also do this change in the dataset and train it yourself in several hours (see Makefile).
Just I'm preparing a version with vectors properly integrated and that should improve resulting POS and DEP quality a little bit.

buriy · 2020-05-22T05:23:24Z

And btw the latest version is 2.2.4 https://pypi.org/project/spacy/#history :)

mcswell · 2020-05-22T14:08:39Z

Thank you for the quick reply! I don't have a GPU (at least not one that works for ML), so I guess I'll wait until next week.

And I wish I could speak Russian like you do English :-)

mcswell · 2020-05-23T05:33:55Z

On a related topic: I notice that spacy-ru (at least the version I have) converts things that I think are acronyms into their lower case equivalents. For example, СССР becomes ссср. I *think* acronyms should remain upper case for downstream processing--at least I wouldn't expect the English 'NASA' to be returned by an English lemmatizer as 'nasa'. Of course I don't know Russian... When I've worked with other languages and I want to avoid lower casing acronyms, I do a regex search that looks for upper case letters after the first letter (since an upper case first letter could just be due to sentence capitalization). A similar issue happens with tokens that contain token-internal numbers; often these are chemical names, like H2O or O2 (the '2' might be subscripted using something like LaTeX, or it could just be the Unicode subscript '2', U+2082). The regex (using the regex library, not the re library) to match an upper case letter or digit is: rxAcronym = regex.compile("[\p{Uppercase_letter}|\p{Digit}]") so I do a search on all but the first character of a token; if the search does not find an upper case letter or a digit, then I go ahead and lower-case the token: if not rxAcronym.search(sToken[1:]): sToken = sToken.lower() Would it make sense to do this for Russian? I suspect it would be done in the spacy-ru code in the file lemmatizer.py, probably in one or more of the places where that code now contains string.lower() Mike Maxwell

…

On 5/22/2020 1:22 AM, Yuri Baburov wrote: This is a bug in spacy, that it doesn't allow numerical features in the Syntagrus dataset used for training ("Person=1", "Person=2", "Person=3"). I have a version with this tag changed (to "Person=first" etc), that will work with 2.2 branch correctly. I'll prepare and upload it early next week. You can also do this change and train it yourself in several hours. Just I'm preparing a version with vectors properly integrated and that should improve resulting POS and DEP quality a little bit. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#18 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADBCJTITMSFFMDOF4SWWWOTRSYDYLANCNFSM4NHHOP5Q>.

-- Mike Maxwell "I may not remember, but I never forget." --Social Crimes, Jane Stanton Hitchcock

buriy · 2020-05-23T05:48:20Z

Oh, you're right. SpaCy has some re-capitalization for the lemmas, so I will need to do the same in the Russian version. Thanks for noting, somehow I missed it completely.
Please note that in SpaCy this behavior is inconsistent and depends on whether the POS tagger was used, etc.
How it works: there's a shape flag in each token (token.shape), which can be Xxx, XXX, xxx and so on, which is then used to restore the capitalization. Only very rare words are capitalized like spaCy -- they will be updated to what shape does display for them.

lexmosolov · 2020-06-24T21:32:24Z

Will "ru2" work well with version 2.3.0?

mcswell · 2020-06-24T22:53:01Z

I've installed spacy v2.3.0:
>>> spacy.__version__ '2.3.0'
When I load the existing version of ru2 using
nlp = spacy.load(<localFile)
I get a warning that

Model 'ru_model' (0.2) requires spaCy v2.1 and is incompatible with the current spaCy version (2.3.0).

And when I try to use nlp(<RussianSentence>), I get the error:

Traceback (most recent call last):
File "<stdin>", line 1, in
File "/usr/local/lib/python3.8/dist-packages/spacy/language.py", line 446, in call
doc = proc(doc, **component_cfg.get(name, {}))
File "pipes.pyx", line 398, in spacy.pipeline.pipes.Tagger.call
File "pipes.pyx", line 443, in spacy.pipeline.pipes.Tagger.set_annotations
File "morphology.pyx", line 315, in spacy.morphology.Morphology.assign_tag_id
File "morphology.pyx", line 203, in spacy.morphology.Morphology.add
ValueError: [E167] Unknown morphological feature: 'Person' (2313063860588076218).
This can happen if the tagger was trained with a different set of morphological features.
If you're using a pretrained model, make sure that your models are up to date:
python -m spacy validate

So it looks like the answer is no.

buriy · 2020-06-25T07:00:18Z

We'll have a version for Spacy 2.2 and Spacy 2.3 on Monday.

gonzagazzz · 2020-06-30T09:48:06Z

Looking forward to the 2.3 support!

buriy · 2020-07-10T16:49:30Z

I've just published SynTagRus-based POS & DEP model for 2.3 right now, but a NER and MIT-licensed POS & DEP is going on to be published several days later.
https://github.com/buriy/spacy-ru/releases/tag/v2.3_pre1

How to use it: unpack into your project root folder, then

import ru2_syntagrus
ru2_syntagrus.load_ru2('path_to/ru2_syntagrus')

Or you could just use spacy.load('path_to/ru2_syntagrus/') but then lemmas will be a bit worse.

buriy mentioned this issue May 23, 2020

Add lemma case restoring in spaCy #19

Open

buriy closed this as completed Jul 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incompatible with spacy v2.2.3? #18

Incompatible with spacy v2.2.3? #18

mcswell commented May 21, 2020

buriy commented May 22, 2020 •

edited

Loading

buriy commented May 22, 2020

mcswell commented May 22, 2020

mcswell commented May 23, 2020 via email

buriy commented May 23, 2020

lexmosolov commented Jun 24, 2020

mcswell commented Jun 24, 2020

buriy commented Jun 25, 2020

gonzagazzz commented Jun 30, 2020

buriy commented Jul 10, 2020 •

edited

Loading

Incompatible with spacy v2.2.3? #18

Incompatible with spacy v2.2.3? #18

Comments

mcswell commented May 21, 2020

buriy commented May 22, 2020 • edited Loading

buriy commented May 22, 2020

mcswell commented May 22, 2020

mcswell commented May 23, 2020 via email

buriy commented May 23, 2020

lexmosolov commented Jun 24, 2020

mcswell commented Jun 24, 2020

buriy commented Jun 25, 2020

gonzagazzz commented Jun 30, 2020

buriy commented Jul 10, 2020 • edited Loading

buriy commented May 22, 2020 •

edited

Loading

buriy commented Jul 10, 2020 •

edited

Loading