-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incompatible with spacy v2.2.3? #18
Comments
This is a bug in spacy, that it doesn't allow numerical features in the Syntagrus dataset used for training ("Person=1", "Person=2", "Person=3"). |
And btw the latest version is 2.2.4 https://pypi.org/project/spacy/#history :) |
Thank you for the quick reply! I don't have a GPU (at least not one that works for ML), so I guess I'll wait until next week. And I wish I could speak Russian like you do English :-) |
On a related topic: I notice that spacy-ru (at least the version I have)
converts things that I think are acronyms into their lower case
equivalents. For example, СССР becomes ссср. I *think* acronyms should
remain upper case for downstream processing--at least I wouldn't expect
the English 'NASA' to be returned by an English lemmatizer as 'nasa'.
Of course I don't know Russian...
When I've worked with other languages and I want to avoid lower casing
acronyms, I do a regex search that looks for upper case letters after
the first letter (since an upper case first letter could just be due to
sentence capitalization).
A similar issue happens with tokens that contain token-internal numbers;
often these are chemical names, like H2O or O2 (the '2' might be
subscripted using something like LaTeX, or it could just be the Unicode
subscript '2', U+2082).
The regex (using the regex library, not the re library) to match an
upper case letter or digit is:
rxAcronym = regex.compile("[\p{Uppercase_letter}|\p{Digit}]")
so I do a search on all but the first character of a token; if the
search does not find an upper case letter or a digit, then I go ahead
and lower-case the token:
if not rxAcronym.search(sToken[1:]):
sToken = sToken.lower()
Would it make sense to do this for Russian? I suspect it would be done
in the spacy-ru code in the file lemmatizer.py, probably in one or more
of the places where that code now contains string.lower()
Mike Maxwell
…On 5/22/2020 1:22 AM, Yuri Baburov wrote:
This is a bug in spacy, that it doesn't allow numerical features in the
Syntagrus dataset used for training ("Person=1", "Person=2", "Person=3").
I have a version with this tag changed (to "Person=first" etc), that
will work with 2.2 branch correctly.
I'll prepare and upload it early next week.
You can also do this change and train it yourself in several hours.
Just I'm preparing a version with vectors properly integrated and that
should improve resulting POS and DEP quality a little bit.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#18 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBCJTITMSFFMDOF4SWWWOTRSYDYLANCNFSM4NHHOP5Q>.
--
Mike Maxwell
"I may not remember, but I never forget."
--Social Crimes, Jane Stanton Hitchcock
|
Oh, you're right. SpaCy has some re-capitalization for the lemmas, so I will need to do the same in the Russian version. Thanks for noting, somehow I missed it completely. |
Will "ru2" work well with version 2.3.0? |
I've installed spacy v2.3.0:
And when I try to use
So it looks like the answer is no. |
We'll have a version for Spacy 2.2 and Spacy 2.3 on Monday. |
Looking forward to the 2.3 support! |
I've just published SynTagRus-based POS & DEP model for 2.3 right now, but a NER and MIT-licensed POS & DEP is going on to be published several days later. How to use it: unpack into your project root folder, then
Or you could just use |
I have spacy v2.1.9 installed on one machine, and 2.2.3 (the current latest version) on another. I installed spacy-ru on both, but it only runs well on the 2.1.9 machine. On the 2.2.3 machine, when I do the
doc=nlp(s)
step (with s=Russian text), I get the error
I guess I could build
spacy-ru
from source and maybe this would solve the problem, but I'm not sure I'm up to that. What I did instead was to uninstall version 2.2.3 ofspacy
, and install version 2.1.9 in its place, so nowspacy-ru
works on both machines.But I'd rather be using the current version of
spacy
, which I use for a couple other languages as well. (Even better, I'd likespacy-ru
to be immune to version changes inspacy
, but I suppose that's asking a bit much :-).)Is there a (simple) way to make
spacy-ru
compatible with v2.2 ofspacy
?The text was updated successfully, but these errors were encountered: