Releases: tsproisl/SoMaJo
Releases · tsproisl/SoMaJo
v1.10.3
- New option
-v
/--version
to output version information.
- Explicitly specify input encoding as UTF-8.
v1.10.2
Fixed a bug where SoMaJo could not be installed if regex was not already there.
v1.10.0
- Treat emoji sequences that render as a single grapheme as a single token. This includes flags and sequences containing modifiers and zero-width joiners.
- Recognize underscores used for "underlining" and split them off.
- Added a few Unicode formatting characters to the “nasty” characters.
- Replaced POSIX character classes with built-ins or Unicode properties.
v1.9.0
Added a new method Tokenizer.tokenize_file for easy tokenization of files from Python.
v1.8.3
This is a bugfix release (see CHANGES.txt).
v1.8.2
This release fixes two bugs (cf. CHANGES.txt for details).
v1.8.1
Bugfix release, see CHANGES.txt.
v1.8.0
- SoMaJo can now tokenize English text.
- Minor improvements to tokenization.
v1.7.0
SoMaJo has now full XML support, see CHANGES.txt for details.
v1.6.0
Some small improvements, see CHANGES.txt.