Skip to content

Releases: projecte-aina/spacy

Release Corpus 3.2.9

20 Oct 15:29
3a7f46f
Compare
Choose a tag to compare

Latest AnCora Corpus for training spacy 3.2 for catalan, perpared by the Text Mining Unit of the Barcelona Supercomputing Center from de UD version.

  • Added IOB-NER labels in the last column
  • Normalized lemmas
  • adds "SpaceAfter=No" after verb followed by clitic and before apostrophes
  • Modified some column1 forms to make it match the text form
  • Removed multi-word token lines
  • Some minor fixes
  • Created new splits from UD corpus, that increase the size of the train set.

3.2.8

22 Jul 12:13
4f915ab
Compare
Choose a tag to compare

20210714
Dataset fet a partir de la versió 2.8, al que s’ha:
tret els token multiparaula (ara queden com a dos o més tokens independents). inclou prep+article i verb+clític
normalitzat els lemes de preposicions i pronoms
afegit el guió als enclítics

Minor improvements to training data before 3.1 release

13 Jul 09:23
b3505fb
Compare
Choose a tag to compare

Minor improvements to training data before 3.1 release. Elimination of multi-word tokens from UD data

New model Releases

11 Jun 12:41
b00ac54
Compare
Choose a tag to compare

3.2.6 releases with small errors corrected on the training datasets. Improved evaluation and better lemmatization, POS and sentence segmentation.

Training Datasets from Ancora

10 Jun 10:31
480b617
Compare
Choose a tag to compare
Pre-release

Training Datasets from Ancora, with some minor corrections for the SpaceAfter=no tag

Releases using tar.gz

02 Jun 07:10
e00a706
Compare
Choose a tag to compare

Experimental releases using tar archives for serverless environments

Training Datasets from Ancora

01 Jun 11:55
a5885ce
Compare
Choose a tag to compare

New training datasets with pronoun lemmatization and tokenization improved

Lookup tables for lemmatizer

21 May 10:35
ed8f00c
Compare
Choose a tag to compare

Lemmas lookup tables to incorporate into lemmatization

Fasttext ca embeddings for spacy

20 May 13:32
fdee914
Compare
Choose a tag to compare

Fasttext embeddings from TextCat corpus, as described in https://doi.org/10.5281/zenodo.4522040
We are using the cbow 300 dimension ones, converted for spacy.

ca_core_web_lg

20 May 09:59
30f8465
Compare
Choose a tag to compare

Base model without BERTa transformer, using only FastText embeddings