Skip to content

Latest commit

 

History

History
234 lines (139 loc) · 7.13 KB

CHANGELOG.md

File metadata and controls

234 lines (139 loc) · 7.13 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[1.7.0]

Fixed

  • Fixed compatibility with transformers v4.47+.

[1.6.0]

Fixed

  • Fixed integrations with newer dependency versions, like transformers and huggingface_hub.

Deprecated

  • Deprecated Python 3.8.

[1.5.0]

Added

  • Added support for BILO tagging schemes.

Changed

  • Changed the error when an empty sentence is provided to the tokenizer.
  • Using spaCy nlp.pipe now processes texts sentence-wise, just like for nlp(...).

Fixed

  • No longer override language metadata from the dataset if the language was also set manually via SpanMarkerModelCardData.
  • No longer crash on predict with ValueError: Failed to concatenate on axis=1 ... if the first sentence in a list of sentences is just one word.

[1.4.0]

Added

  • Added SpanMarkerModel.generate_model_card() method to get a model card string.
  • Added SpanMarkerModelCardData that should be passed to SpanMarkerModel.from_pretrained with additional information like
    • language, license, model_name, model_id, encoder_name, encoder_id, dataset_name, dataset_id, dataset_revision.
  • Added transformers pipeline support, e.g. pipeline(task="span-marker", model="tomaarsen/span-marker-mbert-base-multinerd").

Changed

  • Heavily improved automatic model card generated.
  • Evaluating outside of training now returns per-label outputs instead of only "overall" F1, precision and recall.
  • Warn if the used tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space.
    • If so, then inference of that model will require the punctuation to be split from the words.
  • Improve label normalization speed.
  • Allow you to call SpanMarkerModel.from_pretrained with a pre-initialized SpanMarkerConfig.

Deprecated

  • Deprecated Python 3.7.

Fixed

  • Fixed tokenization mismatch between training and inference for XLM-RoBERTa models: allows for normal inference of those models.
  • Resolve niche bug when TrainingArguments are not provided.

[1.3.0]

Added

  • Added an overwrite_entities parameter to the spaCy pipeline component to allow for overwriting spaCy entities.
  • Added .pipe() method to spaCy integration to allow for batched inference.

Changed

  • Stop overwriting spaCy entities by default.

[1.2.5]

Fixed

  • Allow for immutable TrainingArguments from newer transformers release.

[1.2.4]

Fixed

  • Resolved broken license information.

[1.2.3]

Fixed

  • Fix crash in spaCy inference when using subsequent whitespace.

[1.2.2]

Added

  • Added support for using span_marker spaCy pipeline component without importing SpanMarker.

[1.2.1]

Added

  • Added support for load_in_8bit=True and device_map="auto".

[1.2.0]

Added

  • Added trained_with_document_context to the SpanMarkerConfig.
    • Added warnings if a model is trained with document-context and evaluated/inferenced without, or vice versa.
  • Added spaCy integration via nlp.add_pipe("span_marker"). See the SpanMarker with spaCy documentation for information.

Changed

  • Heavily improved computational efficiency of sample spreading, resulting in notably faster inference speeds.
  • Disable progress bar for inference by default, and add show_progress_bar parameter to SpanMarkerModel.predict.

Fixed

  • Fixed evaluation method failing when the testing dataset contains two adjacent and identical sentences.

[1.1.1]

Fixed

  • Add missing space in model card template.
  • Return nested list if input is a singular list of sentences or a dataset with one sample.

[1.1.0]

Added

  • Added support for document-level context in training, evaluation and inference.
    • Use it by supplying document_id and sentence_id columns to the Trainer datasets.
    • Tune it by supplying max_prev_context and max_next_context to the SpanMarkerConfig via SpanMarkerModel.from_pretrained(..., max_prev_context=3).
  • Added batch inference support via SpanMarkerModel.predict(..., batch_size=4).

Changed

  • Ensure models are in evaluation mode when using SpanMarkerModel.predict.

Deprecated

  • Removed the allow_overlapping optional keyword from SpanMarkerModel.predict

[1.0.1]

Fixed

  • Fixed critical issue with incorrect predictions at inputs that require multiple samples.

[1.0.0]

Added

  • Added a warning for entities that are ignored/skipped due to the maximum entity length or maximum model input length.
  • Added info-level logs displaying the detected labeling scheme (IOB/IOB2, BIOES, BILOU, none).
  • Added a warning suggesting to use model.cuda() when predictions are performed on a CPU while CUDA is available.
  • Added try_cuda method to SpanMarkerModel which tries to place the model on CUDA and does nothing if that fails.

Changed

  • Updated where in the input IDs the span markers are stored, results in 40% training and inferencing speed increase.
  • Updated default marker_max_length in SpanMarkerConfig from 256 to 128.
  • Updated default entity_max_length in SpanMarkerConfig from 16 to 8.
  • Add support for datasets<2.6.0.
  • Add warning if a <v1.0.0 model is loaded using v1.0.0 or newer.
  • Propagate SpanMarkerModel.from_pretrained kwargs to the encoder its AutoModel.from_pretrained.
  • Ignore UndefinedMetricWarning when evaluation f1 is 0.
  • Improved model card generation.

Fixed

  • Resolved tricky issue causing models to learn to never predict the last token as an entity (Closes #1).
  • Fixed label normalization for BILOU datasets.

[0.2.2] - 2023-04-13

Fixed

  • Correctly propagate SpanMarkerModel.from_pretrained kwargs to the config initialisation.

[0.2.1] - 2023-04-07

Added

  • Save span_marker_version in config files from now on.

Changed

  • SpanMarkerModel.save_pretrained and SpanMarkerModel.push_to_hub now also pushes the tokenizer and a simple model card.

[0.2.0] - 2023-04-06

Added

  • Added missing docstrings.

Changed

  • Updated how entity span indices are returned for SpanMarkerModel.predict.

Fixed

  • Prevent incorrect labels when loading a model trained with a schemed (e.g. IOB, BIOES) dataset.
  • Fix several bugs with loading finetuned SpanMarker models.
  • Add missing methods to SpanMarkerTokenizer.
  • Fix endless recursion bug when providing a compute_metrics to the Trainer.

[0.1.1] - 2023-03-31

Fixed

  • Prevent crash when args not supplied to Trainer.
  • Prevent crash on evaluation when using fp16=True as a Training Argument.

[0.1.0] - 2023-03-30

Added

  • Implement initial working version.