Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 0.2.0 Dev Branch #48

Merged
merged 25 commits into from
Jan 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
fd32c18
Setup initial spaCy project for v0.2.0
ljvmiranda921 Jan 4, 2025
cbf9967
Update README
ljvmiranda921 Jan 5, 2025
cd742a8
Add requirements.txt
ljvmiranda921 Jan 5, 2025
b9a2b81
Fix link for tlunified-ner
ljvmiranda921 Jan 5, 2025
0d87c42
Fix path to pretraining data
ljvmiranda921 Jan 5, 2025
9c3f649
Add meta.json file
ljvmiranda921 Jan 5, 2025
313a1d7
Add cupy-cuda113 to dependencies
ljvmiranda921 Jan 5, 2025
9d03a4a
Add other necessary dependencies
ljvmiranda921 Jan 5, 2025
5528e89
Fix cuda version for spaCy
ljvmiranda921 Jan 5, 2025
5f12d85
Some cache hotfixes
ljvmiranda921 Jan 6, 2025
edac1c5
Train using GPU
ljvmiranda921 Jan 6, 2025
16b7751
Add stubs for evaluation
ljvmiranda921 Jan 18, 2025
b2736e5
Add artifacts for other NER test datasets
ljvmiranda921 Jan 18, 2025
cda7ca3
Add conversion script for spaCy files (#52)
ljvmiranda921 Jan 19, 2025
dcdef13
Add evaluation workflows
ljvmiranda921 Jan 19, 2025
b8d198f
Add some fixes
ljvmiranda921 Jan 19, 2025
835a332
Fix incorrect sizes
ljvmiranda921 Jan 19, 2025
2d2e532
Do not ignore evaluations
ljvmiranda921 Jan 19, 2025
0586d66
Save all evaluations
Jan 19, 2025
c7c6f53
Add script to report results (#53)
ljvmiranda921 Jan 19, 2025
3c35e16
Remove unnecessary functions
ljvmiranda921 Jan 19, 2025
1c18941
Update website and blog for releases (#54)
ljvmiranda921 Jan 20, 2025
4e96763
Update loaders to include 0.2.0 models
ljvmiranda921 Jan 20, 2025
0147239
Bump version to 0.2.0
ljvmiranda921 Jan 20, 2025
2964367
[ci skip] Update release date
ljvmiranda921 Jan 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion calamancy/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "0.1.4"
__version__ = "0.2.0"

from .inference import EntityRecognizer, Parser, Tagger
from .loaders import get_latest_version, load, models
Expand Down
3 changes: 3 additions & 0 deletions calamancy/loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ def _get_models_url() -> Dict[str, str]:
tracked and the download functions below work as expected.
"""
return {
"tl_calamancy_md-0.2.0": f"https://huggingface.co/ljvmiranda921/tl_calamancy_md/resolve/{GIT_REF}/tl_calamancy_md-any-py3-none-any.whl",
"tl_calamancy_lg-0.2.0": f"https://huggingface.co/ljvmiranda921/tl_calamancy_lg/resolve/{GIT_REF}/tl_calamancy_lg-any-py3-none-any.whl",
"tl_calamancy_trf-0.2.0": f"https://huggingface.co/ljvmiranda921/tl_calamancy_trf/resolve/{GIT_REF}/tl_calamancy_trf-any-py3-none-any.whl",
"tl_calamancy_md-0.1.0": f"https://huggingface.co/ljvmiranda921/tl_calamancy_md-0.1.0/resolve/{GIT_REF}/tl_calamancy_md-0.1.0-py3-none-any.whl",
"tl_calamancy_lg-0.1.0": f"https://huggingface.co/ljvmiranda921/tl_calamancy_lg-0.1.0/resolve/{GIT_REF}/tl_calamancy_lg-0.1.0-py3-none-any.whl",
"tl_calamancy_trf-0.1.0": f"https://huggingface.co/ljvmiranda921/tl_calamancy_trf-0.1.0/resolve/{GIT_REF}/tl_calamancy_trf-0.1.0-py3-none-any.whl",
Expand Down
2 changes: 2 additions & 0 deletions models/v0.2.0/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
packages
models
124 changes: 124 additions & 0 deletions models/v0.2.0/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
<!-- WEASEL: AUTO-GENERATED DOCS START (do not remove) -->

# 🪐 Weasel Project: Release v0.2.0

This is a spaCy project that trains the v0.2.0 models for calamanCy.
Here are some of the major changes in this release:

- **Included trainable lemmatizer in the pipeline**: instead of a rules-based
lemmatizer, we are now using the [neural edit-tree
lemmatizer](https://explosion.ai/blog/edit-tree-lemmatizer).
- **Trained on UD-NewsCrawl**: this is a major update, as we are now training
our parser, tagger, and morphologizer components on the larger
[UD-NewsCrawl](https://huggingface.co/datasets/UD-Filipino/UD_Tagalog-NewsCrawl)
treebank. Our training dataset has now increased from 150+ to 15,000! From
this point forward, we will be using the UD-TRG and UD-Ugnayan treebanks as
test sets (as intended).
- **Better evaluations**: Aside from evaluating our dependency parser and POS tagger on UD-TRG and UD-Ugnayan, we have also included Universal NER ([Mayhew et al., 2023](https://arxiv.org/abs/2311.09122)) as our test set for evaluating the NER component.
- **Improved base model for tl_calamancy_trf**: Based on internal evaluations, we are now using [mDeBERTa-v3 (base)](https://huggingface.co/microsoft/mdeberta-v3-base) as our source of context-sensitive vectors for tl_calamancy_trf.
- **Simpler pipelines, no more pretraining**: We found that pretraining doesn't really offer huge performance gains (0-1%) given the huge effort and time needed to do it. Hence, for ease of training the whole pipeline, we removed it from the calamanCy recipe.

The namespaces for the latest models remain the same.
The legacy models will have an explicit version number in their HuggingFace repositories.
Please see [this HuggingFace collection](https://huggingface.co/collections/ljvmiranda921/calamancy-models-for-tagalog-nlp-65629cc46ef2a1d0f9605c87) for more information.

## Set-up

You can use this project to replicate the pipelines shipped by the project.
First, you need to install the required dependencies:

```sh
pip install -r requirements.txt
```

Then run the set-up commands:

```sh
python -m spacy project assets
python -m spacy project run setup
```

This step downloads all assets and prepares all the datasets and binaries for
training use. For example, if you want to train `tl_calamancy_md`, run the following comand:

```sh
MODEL=tl_calamancy_md scripts/train.sh
```


## Model information

The table below shows an overview of the calamanCy models in this project. For more information,
I suggest checking the [language pipeline metadata](https://spacy.io/api/language#meta).


| Model | Pipelines | Description |
|-----------------------------|---------------------------------------------|--------------------------------------------------------------------------------------------------------------|
| tl_calamancy_md (214 MB) | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys) |
| tl_calamancy_lg (482 MB) | tok2vec, tagger, trainable_lemmatizer, morphologizer, parser, ner | CPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k) |
| tl_calamancy_trf (1.7 GB) | transformer, tagger, trainable_lemmatizer, morphologizer, parser, ner | GPU-optimized transformer Tagalog NLP model. Uses mdeberta-v3-base as context vectors. |


## 📋 project.yml

The [`project.yml`](project.yml) defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
[Weasel documentation](https://github.com/explosion/weasel).

### ⏯ Commands

The following commands are defined by the project. They
can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run).
Commands are only re-run if their inputs have changed.

| Command | Description |
| --- | --- |
| `setup-finetuning-data` | Prepare the Tagalog corpora used for training various spaCy components |
| `setup-fasttext-vectors` | Make fastText vectors spaCy compatible |
| `build-floret` | Build floret binary for training fastText / floret vectors |
| `train-vectors-md` | Train medium-sized word vectors (200 dims, 200k keys) using the floret binary. |
| `train-parser` | Train a trainable_lemmatizer, parser, tagger, and morphologizer using the Universal Dependencies treebanks |
| `train-parser-trf` | Train a trainable_lemmatizer, parser, tagger, and morphologizer using the Universal Dependencies treebanks |
| `train-ner` | Train ner component |
| `train-ner-trf` | Train ner component |
| `assemble` | Assemble pipelines to create a single spaCy piepline |
| `assemble-trf` | Assemble pipelines to create a single spaCy piepline |
| `setup-eval-data` | Convert remaining test datasets |
| `evaluate-model` | Evaluate a model |

### ⏭ Workflows

The following workflows are defined by the project. They
can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run)
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.

| Workflow | Steps |
| --- | --- |
| `setup` | `setup-finetuning-data` &rarr; `setup-fasttext-vectors` &rarr; `build-floret` &rarr; `train-vectors-md` |
| `tl-calamancy` | `train-parser` &rarr; `train-ner` &rarr; `assemble` |
| `tl-calamancy-trf` | `train-parser-trf` &rarr; `train-ner-trf` &rarr; `assemble-trf` |
| `evaluate` | `setup-eval-data` &rarr; `evaluate-model` |

### 🗂 Assets

The following assets are defined by the project. They can
be fetched by running [`weasel assets`](https://github.com/explosion/weasel/tree/main/docs/cli.md#open_file_folder-assets)
in the project directory.

| File | Source | Description |
| --- | --- | --- |
| `assets/tlunified_raw_text.txt` | URL | Pre-converted raw text from TLUnified in JSONL format (1.1 GB). |
| `assets/corpus.tar.gz` | URL | Annotated TLUnified corpora in spaCy format with train, dev, and test splits. |
| `assets/tl_newscrawl-ud-train.conllu` | URL | Train dataset for NewsCrawl |
| `assets/tl_newscrawl-ud-dev.conllu` | URL | Dev dataset for NewsCrawl |
| `assets/tl_newscrawl-ud-test.conllu` | URL | Test dataset for NewsCrawl |
| `assets/tl_trg-ud-test.conllu` | URL | Test dataset for TRG |
| `assets/tl_ugnayan-ud-test.conllu` | URL | Test dataset for Ugnayan |
| `assets/uner_trg.iob2` | URL | Test dataset for Universal NER TRG |
| `assets/uner_ugnayan.iob2` | URL | Test dataset for Universal NER Ugnayan |
| `assets/tfnerd.txt` | URL | Test dataset for TF-NERD |
| `assets/fasttext.tl.gz` | URL | Tagalog fastText vectors provided from the fastText website (trained from CommonCrawl and Wikipedia). |
| `assets/floret` | Git | Floret repository for training floret and fastText models. |

<!-- WEASEL: AUTO-GENERATED DOCS END (do not remove) -->
38 changes: 38 additions & 0 deletions models/v0.2.0/configs/assemble.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
[paths]
parser_model = null
ner_model = null

[nlp]
lang = "tl"
pipeline = ["tok2vec", "trainable_lemmatizer", "tagger", "morphologizer", "parser", "ner"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[initialize]
vectors = ${paths.parser_model}

[components]

[components.tok2vec]
source = ${paths.parser_model}
component = "tok2vec"

[components.trainable_lemmatizer]
source = ${paths.parser_model}
component = "trainable_lemmatizer"

[components.tagger]
source = ${paths.parser_model}
component = "tagger"

[components.morphologizer]
source = ${paths.parser_model}
component = "morphologizer"

[components.parser]
source = ${paths.parser_model}
component = "parser"

[components.ner]
source = ${paths.ner_model}
component = "ner"
replace_listeners = ["model.tok2vec"]
35 changes: 35 additions & 0 deletions models/v0.2.0/configs/assemble_trf.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
[paths]
parser_model = null
ner_model = null

[nlp]
lang = "tl"
pipeline = ["transformer", "trainable_lemmatizer", "tagger", "morphologizer", "parser", "ner"]
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.transformer]
source = ${paths.parser_model}
component = "transformer"

[components.trainable_lemmatizer]
source = ${paths.parser_model}
component = "trainable_lemmatizer"

[components.tagger]
source = ${paths.parser_model}
component = "tagger"

[components.morphologizer]
source = ${paths.parser_model}
component = "morphologizer"

[components.parser]
source = ${paths.parser_model}
component = "parser"

[components.ner]
source = ${paths.ner_model}
component = "ner"
replace_listeners = ["model.tok2vec"]
145 changes: 145 additions & 0 deletions models/v0.2.0/configs/ner.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "tl"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]
Loading