All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Build issue due to leading
./in included file paths (#7)
Paper published in ACS JCIM: Tokenization for Molecular Foundation Models
- Bumped PyO3, tokenizers and dict_derive dependencies (#2)
- Switched to uv for CI/pre-commit workflows (#2)
- Increased minimum python version to 3.9 (#2)
- Mark version as dynamic in pyproject (#2)
- The vocab for
SmirkSelfiesFastcan now be set by passing avocab_file(#3) - The default unknown token for the rust
SmirkTokenzieris now[UNK]matching the python default (#3)
- Renamed
SmirkSelfiesFastvocabparameter tovocab_file(#3) - Default for
--split-structureis nowTrueforsmirk.cliandtrain_gpe(#3) - Moved GPE training from a method (
SmirkTokenizerFast.train) to a function (smirk.train_gpe) (#3)
v0.1.1 - 2024-12-09
Preprint v2 posted: arXiv:2409.15370v2
- Added support for post-processing templates to
SmirkTokenizerFast(#1) - Registered smirk with transformer's AutoTokenizer (#1)
- Added
vocab,convert_ids_to_tokensandconvert_tokens_to_idsmethods (#1) - Added support for truncating and padding during tokenization (#1)
- Fixed CI to install test dependencies (#1)
v0.1.0 - 2024-09-11
Preprint posted: arXiv:2409.15370v1
- Initial tagged version of smirk