All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Fixed compatibility with
transformers
v4.47+.
- Fixed integrations with newer dependency versions, like
transformers
andhuggingface_hub
.
- Deprecated Python 3.8.
- Added support for BILO tagging schemes.
- Changed the error when an empty sentence is provided to the tokenizer.
- Using spaCy
nlp.pipe
now processes texts sentence-wise, just like fornlp(...)
.
- No longer override
language
metadata from the dataset if the language was also set manually viaSpanMarkerModelCardData
. - No longer crash on
predict
withValueError: Failed to concatenate on axis=1 ...
if the first sentence in a list of sentences is just one word.
- Added
SpanMarkerModel.generate_model_card()
method to get a model card string. - Added
SpanMarkerModelCardData
that should be passed toSpanMarkerModel.from_pretrained
with additional information likelanguage
,license
,model_name
,model_id
,encoder_name
,encoder_id
,dataset_name
,dataset_id
,dataset_revision
.
- Added
transformers
pipeline
support, e.g.pipeline(task="span-marker", model="tomaarsen/span-marker-mbert-base-multinerd")
.
- Heavily improved automatic model card generated.
- Evaluating outside of training now returns per-label outputs instead of only "overall" F1, precision and recall.
- Warn if the used tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space.
- If so, then inference of that model will require the punctuation to be split from the words.
- Improve label normalization speed.
- Allow you to call SpanMarkerModel.from_pretrained with a pre-initialized SpanMarkerConfig.
- Deprecated Python 3.7.
- Fixed tokenization mismatch between training and inference for XLM-RoBERTa models: allows for normal inference of those models.
- Resolve niche bug when TrainingArguments are not provided.
- Added an
overwrite_entities
parameter to the spaCy pipeline component to allow for overwriting spaCy entities. - Added
.pipe()
method to spaCy integration to allow for batched inference.
- Stop overwriting spaCy entities by default.
- Allow for immutable
TrainingArguments
from newertransformers
release.
- Resolved broken license information.
- Fix crash in spaCy inference when using subsequent whitespace.
- Added support for using
span_marker
spaCy pipeline component without importing SpanMarker.
- Added support for
load_in_8bit=True
anddevice_map="auto"
.
- Added
trained_with_document_context
to the SpanMarkerConfig.- Added warnings if a model is trained with document-context and evaluated/inferenced without, or vice versa.
- Added
spaCy
integration vianlp.add_pipe("span_marker")
. See the SpanMarker with spaCy documentation for information.
- Heavily improved computational efficiency of sample spreading, resulting in notably faster inference speeds.
- Disable progress bar for inference by default, and add
show_progress_bar
parameter toSpanMarkerModel.predict
.
- Fixed evaluation method failing when the testing dataset contains two adjacent and identical sentences.
- Add missing space in model card template.
- Return nested list if input is a singular list of sentences or a dataset with one sample.
- Added support for document-level context in training, evaluation and inference.
- Use it by supplying
document_id
andsentence_id
columns to the Trainer datasets. - Tune it by supplying
max_prev_context
andmax_next_context
to theSpanMarkerConfig
viaSpanMarkerModel.from_pretrained(..., max_prev_context=3)
.
- Use it by supplying
- Added batch inference support via
SpanMarkerModel.predict(..., batch_size=4)
.
- Ensure models are in evaluation mode when using
SpanMarkerModel.predict
.
- Removed the
allow_overlapping
optional keyword fromSpanMarkerModel.predict
- Fixed critical issue with incorrect predictions at inputs that require multiple samples.
- Added a warning for entities that are ignored/skipped due to the maximum entity length or maximum model input length.
- Added info-level logs displaying the detected labeling scheme (IOB/IOB2, BIOES, BILOU, none).
- Added a warning suggesting to use
model.cuda()
when predictions are performed on a CPU while CUDA is available. - Added
try_cuda
method toSpanMarkerModel
which tries to place the model on CUDA and does nothing if that fails.
- Updated where in the input IDs the span markers are stored, results in 40% training and inferencing speed increase.
- Updated default
marker_max_length
in SpanMarkerConfig from 256 to 128. - Updated default
entity_max_length
in SpanMarkerConfig from 16 to 8. - Add support for
datasets<2.6.0
. - Add warning if a
<v1.0.0
model is loaded usingv1.0.0
or newer. - Propagate
SpanMarkerModel.from_pretrained
kwargs to the encoder itsAutoModel.from_pretrained
. - Ignore
UndefinedMetricWarning
when evaluation f1 is 0. - Improved model card generation.
- Resolved tricky issue causing models to learn to never predict the last token as an entity (Closes #1).
- Fixed label normalization for BILOU datasets.
- Correctly propagate
SpanMarkerModel.from_pretrained
kwargs to the config initialisation.
- Save
span_marker_version
in config files from now on.
SpanMarkerModel.save_pretrained
andSpanMarkerModel.push_to_hub
now also pushes the tokenizer and a simple model card.
- Added missing docstrings.
- Updated how entity span indices are returned for
SpanMarkerModel.predict
.
- Prevent incorrect labels when loading a model trained with a schemed (e.g. IOB, BIOES) dataset.
- Fix several bugs with loading finetuned SpanMarker models.
- Add missing methods to
SpanMarkerTokenizer
. - Fix endless recursion bug when providing a
compute_metrics
to the Trainer.
- Prevent crash when
args
not supplied to Trainer. - Prevent crash on evaluation when using
fp16=True
as a Training Argument.
- Implement initial working version.