65 lines (38 loc) · 2.71 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

Fixed

Build issue due to leading ./ in included file paths (#7)

v0.2.0

Paper published in ACS JCIM: Tokenization for Molecular Foundation Models

Added

Started a changelog (#2)
Added a release pipeline (#6)

Changed

Bumped PyO3, tokenizers and dict_derive dependencies (#2)
Switched to uv for CI/pre-commit workflows (#2)

Breaking

Increased minimum python version to 3.9 (#2)

Fixed

Mark version as dynamic in pyproject (#2)
The vocab for SmirkSelfiesFast can now be set by passing a vocab_file (#3)
The default unknown token for the rust SmirkTokenzier is now [UNK] matching the python default (#3)

Removed

Renamed SmirkSelfiesFast vocab parameter to vocab_file (#3)
Default for --split-structure is now True for smirk.cli and train_gpe (#3)
Moved GPE training from a method (SmirkTokenizerFast.train) to a function (smirk.train_gpe) (#3)

v0.1.1 - 2024-12-09

Preprint v2 posted: arXiv:2409.15370v2

Added

Added support for post-processing templates to SmirkTokenizerFast (#1)
Registered smirk with transformer's AutoTokenizer (#1)
Added vocab, convert_ids_to_tokens and convert_tokens_to_ids methods (#1)
Added support for truncating and padding during tokenization (#1)

Fixed

Fixed CI to install test dependencies (#1)

v0.1.0 - 2024-09-11

Preprint posted: arXiv:2409.15370v1

Added

Initial tagged version of smirk