Releases: projecte-aina/spacy
Release Corpus 3.2.9
Latest AnCora Corpus for training spacy 3.2 for catalan, perpared by the Text Mining Unit of the Barcelona Supercomputing Center from de UD version.
- Added IOB-NER labels in the last column
- Normalized lemmas
- adds "SpaceAfter=No" after verb followed by clitic and before apostrophes
- Modified some column1 forms to make it match the text form
- Removed multi-word token lines
- Some minor fixes
- Created new splits from UD corpus, that increase the size of the train set.
3.2.8
20210714
Dataset fet a partir de la versió 2.8, al que s’ha:
tret els token multiparaula (ara queden com a dos o més tokens independents). inclou prep+article i verb+clític
normalitzat els lemes de preposicions i pronoms
afegit el guió als enclítics
Minor improvements to training data before 3.1 release
Minor improvements to training data before 3.1 release. Elimination of multi-word tokens from UD data
New model Releases
3.2.6 releases with small errors corrected on the training datasets. Improved evaluation and better lemmatization, POS and sentence segmentation.
Training Datasets from Ancora
Training Datasets from Ancora, with some minor corrections for the SpaceAfter=no tag
Releases using tar.gz
Experimental releases using tar archives for serverless environments
Training Datasets from Ancora
New training datasets with pronoun lemmatization and tokenization improved
Lookup tables for lemmatizer
Lemmas lookup tables to incorporate into lemmatization
Fasttext ca embeddings for spacy
Fasttext embeddings from TextCat corpus, as described in https://doi.org/10.5281/zenodo.4522040
We are using the cbow 300 dimension ones, converted for spacy.
ca_core_web_lg
Base model without BERTa transformer, using only FastText embeddings