Releases · Natooz/MidiTok

22 Jul 12:40

Natooz

v3.0.6.post1

df0796c

v3.0.6.post1 PerTok patch Latest

Latest

What's Changed

Fix backwards compatibility with older PerTok versions by @JLenzy in #243

Contributors

JLenzy

Assets 2

22 Jun 09:45

Natooz

v3.0.6

ff06fd6

v3.0.6 Minor fixes and `Bar`/`Position` support for PerTok

What's Changed

Fixing build target + using hatch in github actions by @Natooz in #223
Bump codecov/codecov-action from 5.3.1 to 5.4.0 by @dependabot in #226
Bugfix when loading trained PerTok tokenizer by @HuwCheston in #229
Bump codecov/codecov-action from 5.4.0 to 5.4.2 by @dependabot in #230
Fix error in PerTok decoding (MIDI conversion) when use_sustain_pedals=True by @mimbres in #232
Update split.py by @FilippoGalli001 in #233
Bump codecov/codecov-action from 5.4.2 to 5.4.3 by @dependabot in #235
fix: making pitch range upper bound inclusive by @Natooz in #239
fixing pitch_intervals_max_time_dist type hint + rounding float values by @Natooz in #240
PerTok: Position tokens instead of Timeshift by @JLenzy in #236
CI fixes by @Natooz in #241

New Contributors

@HuwCheston made their first contribution in #229
@mimbres made their first contribution in #232
@FilippoGalli001 made their first contribution in #233

Full Changelog: v3.0.5...v3.0.6

Contributors

mimbres, dependabot, and 4 other contributors

Assets 2

17 Feb 08:38

Natooz

v3.0.5.post1

1959c4f

v3.0.5.post1

What's Changed

Fixing build target + using hatch in github actions by @Natooz in #223

Contributors

Natooz

Assets 2

16 Feb 18:32

Natooz

v3.0.5

e94589d

v3.0.5 Bugfixes

What's Changed

fix import HfHubHTTPError with latest hf hub package update by @Natooz in #199
MDTK_200 : implemeted add_trailing_bars by @Mintas in #204
Remove refs to split_midis_for_training in doc by @Zaka in #205
Catching exception when decoding velocity values in MIDILike by @Natooz in #210
Update example notebook reference by @emmanuel-ferdman in #216
bugfix training initial alphabet by @Natooz in #220
Add a parameter augment_copy to the augment_score function by @pstrepetov in #221

New Contributors

@Mintas made their first contribution in #204
@Zaka made their first contribution in #205
@emmanuel-ferdman made their first contribution in #216
@pstrepetov made their first contribution in #221

Full Changelog: v3.0.4...v3.0.5

Contributors

Zaka, Mintas, and 3 other contributors

Assets 2

15 Sep 10:42

Natooz

v3.0.4

7ea77d4

v3.0.4 PerTok tokenizer and Attribute Controls

This release introduces the PerTok tokenizer by Lemonaide AI, attribute controls tokens and minor fixes.

Highlights

PerTok: Performance Tokenizer

(associated paper to be released)

Developed by Julian Lenz (@JLenzy) at Lemonaide AI to capture expressive timing in symbolic scores while maintaining competitively low sequence lengths. It achieves this by dividing time differences into Macro and Micro categories, introducing a new MicroTime token type. Subtle deviations from the quantized beat are represented with these Timeshift tokens.
Furthermore, PerTok enables you to encode an unlimited number of note subdivisions by enabling multiple, overlapping values within the 'beat_res' parameter of the TokenizerConfig.

The micro timing tokens will be extended to all tokenizers in a future update.

### Attribute Control tokens

Attribute controls are additional tokens allowing to train models in order to control them during inference, by enforcing a model to predict music with specific features.

What's Changed

updates to Example_HuggingFace_Mistral_Transformer.ipynb by @briane412 in #164
_model_name is now a protected property by @Natooz in #165
Fixing docs for tokenizer training by @Natooz in #167
Default continuing_subword_prefix when splitting token sequences by @Natooz in #168
small bug fix in MIDI pretokenization by @shenranwang in #170
adding no_preprocess_score argument when tokenizing by @Natooz in #172
TokSequence summable, concatenate_track_sequences arg for MMM by @Natooz in #173
Docs update by @Natooz in #175
Fixing split methods for empty files (no tracks and/or no notes) by @Natooz in #177
Logo now with white outer stroke by @Natooz in #180
Attribute controls feature by @helloWorld199 in #181
better distinction between one_token_stream and config.one_token_stream_for_programs by @Natooz in #182
making sure MMM token sequences are not concatenated when splitting them per bar/beat in tokenizer_training_iterator.py by @Natooz in #183
rST Documentation fixes by @scottclowe in #184
Bump actions/stale from 5.1.1 to 9.0.0 by @dependabot in #185
Bump actions/download-artifact from 3 to 4 by @dependabot in #186
Bump codecov/codecov-action from 3.1.0 to 4.5.0 by @dependabot in #187
Bump actions/upload-artifact from 3 to 4 by @dependabot in #188
Fixing bugs caused by changes from symusic v0.5.0 by @Natooz in #192
use_velocities and use_duration configuration parameters by @Natooz in #193
collator now handles decoder input ids (seq2seq models) by @Natooz in #194
PerTok Tokenizer by @JLenzy in #191

New Contributors

@briane412 made their first contribution in #164
@helloWorld199 made their first contribution in #181
@scottclowe made their first contribution in #184
@dependabot made their first contribution in #185

Full Changelog: v3.0.3...v3.0.4

Contributors

scottclowe, dependabot, and 5 other contributors

Assets 2

25 Apr 12:50

Natooz

v3.0.3

365a5b6

v3.0.3 Training with WordPiece and Unigram + abc files support

Highlights

Support for abc files, which can be loaded and dumped with symusic similarly to MIDI files;
The tokenizers can now also be trained with the WordPiece and Unigram algorithms!
Tokenizer training and token ids encoding can now be performed "bar-wise" or "beat-wise", meaning the tokenizer can learn new tokens from successions of base tokens strictly within bars or beats. This is set by the encode_ids_split attribute of the tokenizer config;
symusic v0.4.3 or higher is now required to comply with the usage of the clip method;
Better handling of file loading errors in DatasetMIDI and DataCollator;
Introducing a new filter_dataset to clean a dataset of MIDI/abc files before using it;
MMM tokenizer has been cleaned up, and is now fully modular: it now works on top of other tokenizations (REMI, TSD and MIDILike) to allow more flexibility and interoperability;
TokSequence objects can now be sliced and concatenated (eg seq3 = seq1[:50] + seq2[50:]);
TokSequence objects tokenized from a tokenizer can now be split per bars or beats subsequences;
minor fixes, code improvements and cleaning;

Methods renaming

A few methods and properties were previously named after "bpe" and "midi". To align with the more general usages of these methods (support for several file formats and training algorithms), they have been renamed with more idiomatic and accurate names.

Methods renamed with depreciation warning:

midi_to_tokens --> encode;
tokens_to_midi --> decode;
learn_bpe --> train;
apply_bpe --> encode_token_ids;
decode_bpe --> decode_token_ids;
ids_bpe_encoded --> are_ids_encoded;
vocab_bpe --> vocab_model.
tokenize_midi_dataset --> tokenize_dataset;

Methods renamed without depreciation warning (less usages, reduces the code messiness):

MIDITokenizer --> MusicTokenizer;
augment_midi --> augment_score;
augment_midi_dataset --> augment_dataset ;
augment_midi_multiple_offsets --> augment_score_multiple_offsets;
split_midis_for_training --> split_files_for_training;
split_midi_per_note_density --> split_score_per_note_density;
get_midi_programs --> get_score_programs;
merge_midis --> merge_scores;
get_midi_ticks_per_beat --> get_score_ticks_per_beat;
split_midi_per_ticks --> split_score_per_ticks;
split_midi_per_beats --> split_score_per_beats;
split_midi_per_tracks --> split_score_per_tracks;
concat_midis --> concat_scores;

Protected internal methods (no depreciation warning, advanced usages):

MIDITokenizer._tokens_to_midi --> MusicTokenizer._tokens_to_score;
MIDITokenizer._midi_to_tokens --> MusicTokenizer._score_to_tokens;
MIDITokenizer._create_midi_events --> MusicTokenizer._create_global_events

There is no other compatibility issue beside these renaming.

Full Changelog: v3.0.2...v3.0.3

Assets 2

24 Mar 14:38

Natooz

v3.0.2

fde2c31

v3.0.2 New data loading and preprocessing methods

Tldr

This new version introduces a new DatasetMIDI class to use when training PyTorch models. It relies on the previously named DatasetTok class, with pre-tokenizing option and better handling of BOS and EOS tokens.
A new miditok.pytorch_data.split_midis_for_training method allows to dynamically chunk MIDIs into smaller parts that make approximately the desire token sequence length, based on the note densities of their bars. These chunks can be used to train a model while maximizing the overall amount of data used.
A few new utils methods have been created for this features, e.g. to split, concat or merge symusic.Score objects.
Thanks @Kinyugo for the discussions and tests that guided the development of the features! (#147)

The update also brings a few minor fixes, and the docs have a new theme!

What's Changed

Fix token_paths to files_paths, and config to model_config by @sunsetsobserver in #145
Fix issues in Octuple with multiple different-beat time signatures by @ilya16 in #146
Pitch interval decoding: discarding notes outside the tokenizer pitch range by @Natooz in #149
Fixing save_pretrained to comply with huggingface_hub v0.21 by @Natooz in #150
ability to overwrite _create_durations_tuples in init by @JLenzy in #153
Refactor of PyTorch data loading classes and methods by @Natooz and @Kinyugo in #148
The docs have a new theme! Using the furo theme.

New Contributors

@sunsetsobserver made their first contribution in #145
@JLenzy made their first contribution in #153

Full Changelog: v3.0.1...v3.0.2

Contributors

ilya16, Kinyugo, and 3 other contributors

Assets 2

02 Feb 08:55

Natooz

v3.0.1

37e28ed

V3.0.1 PitchDrum and minor fixes

What's Changed

use_pitchdrum_tokens option to use dedicated PitchDrum tokens for drums tracks
Fixing time signature preprocessing (time division mismatch) in #132 (#131 @EterDelta)
Fixing data augmentation example and considering all midi extensions in #136 (#135 @oiabtt)
decoding: automatically making sure to decode BPE then completing tokens in #138 (#137 @oiabtt)
load_tokens now returning TokSequence by in #139 (#137 @oiabtt)
convert chord maps back to tuples from list when loading tokenizer from a saved configuration by @shenranwang in #141
can now use MIDITokenizer.from_pretrained similarly to the AutoTokenizer in the Hugging Face transformers library by in #142 (discussed in #127 @oiabtt)

New Contributors

@shenranwang made their first contribution in #141

Full Changelog: v3.0.0...v3.0.1

Contributors

oiabtt, shenranwang, and EterDelta

Assets 2

17 Jan 20:04

Natooz

v3.0.0

3f2c372

V3.0 Switch to Symusic - performance boost

Switch to symusic

This major version marks the switch from the miditoolkit MIDI reading/writing library to symusic, and a large optimisation of the MIDI preprocessing steps.

Symusic is a MIDI reading / writing library written in C++ with Python binding, offering unmatched speeds, up to 500 times faster than native Python libraries. It is based on minimidi. The two libraries are created and maintained by @Yikai-Liao and @lzqlzzq, who did an amazing work, which is still ongoing as many useful features are on the roadmap! 🫶

Tokenizers from previous versions are compatible with this new version, but their might be some time variations if you compare how MIDIs are tokenized and tokens decoded.

Performance boost

These changes result in a way faster MIDI loading/writing and tokenization times! The overall tokenization (loading MIDI and tokenizing it) is between 5 to 12 times faster depending the tokenizer and data. You can find other benchmarks here.

This huge speed gain allows to discard the previously recommended step of pre-tokenizing MIDI files as json tokens, and directly tokenize the MIDIs on the fly while training/using a model! We updated the usage examples of the docs accordingly, the code is now simplified.

Other major changes

When using time signatures, time tokens are now computed in ticks per beat, as opposed to ticks per quarter note as done previously. This change is in line with the definition of time and duration tokens, which was not handled following the MIDI norm for note values other than the quarter note until now (#124);
Adding new ruff rules and their fixes to comply, increasing the code quality in #115;
MidiTok still supports miditoolkit.MidiFile objects, but those will be converted on the fly to a symusic.Score object and a depreciation warning will be thrown;
The data augmentation methods on the token level has been removed, in favour of better data augmentation operating directly on MIDIs, now much faster, simplifying processes and now handling durations;
The docs are fixed;
The tokenization tests workflows has been unified and considerably simplified, leading to more robust test assertions. We also increased the number of test cases and configurations, while decreasing the test time.

Other minor changes

Setting special tokens values in TokenizerConf in #114
Update README.md by @kalyani2003 in #120
Readthedocs preview action for PRs in #125

New Contributors

@kalyani2003 made their first contribution in #120

Full Changelog: v2.1.8...v3.0.0

Contributors

lzqlzzq, kalyani2003, and Yikai-Liao

Assets 2

28 Nov 13:35

Natooz

v2.1.8

89c4678

v2.1.8 Pitch Intervals & minor fixes

This new version brings a new additional token type: pitch intervals. It allows to represent pitch intervals for simultaneous and successive note. You can read more details about how it works in the docs.
We greatly improved the tests and Ci workflow, and fixed a few minor bugs and improvements along the way.

This new version also drops support for Python 3.7, and now requires Python 3.8 and newer. You can read more about the decision and how to make it retro-compatible in the docs.

We encourage you to update to the latest miditoolkit version, which also features some fixes and improvements. The most notable one is a clean of the dependencies, and compatibility with recent numpy versions!

What's Changed

Typos fixes in docs by @eltociear (#89), @gfggithubleet (#91 and #93), @shresthasurav (#94), @THEFZNKHAN (#98 and #99)
Fixing a bug when learning bpe without special tokens by @Natooz in #92
Switch lint/isort/format to Ruff by @akx in #105
Adding pitch interval option by @Natooz in #103
Switching to pyproject.toml and hatch packaging by @Natooz in #106
Fix data augment by @parneyw in #109
dealing with empty midi file by @feiyuehchen in #110
Better tests + minor improvements by @Natooz in #108

New Contributors

@eltociear made their first contribution in #89
@gfggithubleet made their first contribution in #91
@shresthasurav made their first contribution in #94
@THEFZNKHAN made their first contribution in #98
@akx made their first contribution in #105
@parneyw made their first contribution in #109
@feiyuehchen made their first contribution in #110

Full Changelog: v2.1.7...v2.1.8

Contributors

akx, eltociear, and 6 other contributors

Assets 2

Releases: Natooz/MidiTok

v3.0.6.post1 PerTok patch

What's Changed

Contributors

Uh oh!

v3.0.6 Minor fixes and `Bar`/`Position` support for PerTok

What's Changed

New Contributors

Contributors

Uh oh!

v3.0.5.post1

What's Changed

Contributors

Uh oh!

v3.0.5 Bugfixes

What's Changed

New Contributors

Contributors

Uh oh!

v3.0.4 PerTok tokenizer and Attribute Controls

Highlights

PerTok: Performance Tokenizer

What's Changed

New Contributors

Contributors

Uh oh!

v3.0.3 Training with WordPiece and Unigram + abc files support

Highlights

Methods renaming

Uh oh!

v3.0.2 New data loading and preprocessing methods

Tldr

What's Changed

New Contributors

Contributors

Uh oh!

V3.0.1 PitchDrum and minor fixes

What's Changed

New Contributors

Contributors

Uh oh!

V3.0 Switch to Symusic - performance boost

Switch to symusic

Performance boost

Other major changes

Other minor changes

New Contributors

Contributors

Uh oh!

v2.1.8 Pitch Intervals & minor fixes

What's Changed

New Contributors

Contributors

Uh oh!